AU2020102038A4

AU2020102038A4 - A speaker identification method based on deep learning

Info

Publication number: AU2020102038A4
Application number: AU2020102038A
Authority: AU
Inventors: Yichen Jia; Jinyi Liu; Yanan WU
Original assignee: Liu Jinyi Miss; Wu Yanan Miss
Current assignee: Liu Jinyi Miss; Wu Yanan Miss
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-10-08
Anticipated expiration: 2028-08-28

Abstract

This invention lies in the field of digital audio processing, which is a speech recognition system for identifying different identities based on deep learning. The invention consists of the following steps: First of all, the preparation of sufficient data was made, and the data was also split into training and testing data. Secondly, we preprocessed the data by using Voice Activity Detection (VAD) for detecting the effective audio segments, and Mel-frequency cepstral coefficients (MFCC) for feature extraction. Then, the training data were batched into the Convolutional Neural Network (CNN) that we have already set. Simultaneously, the parameters in the CNN were adjusted, namely dropout rate, learning base rate, loss rate, in order to optimize the performance of the model. Eventually, the optimal CNN can be used for the testing data, and the identities can be recognized with an accuracy of 92.6%. In brief, the identity of the speaker can be recognized automatically without human involvement by this invention. 1 Raw Data Mono conversion Voice Activity Detection Audio Segment transform to .mat format Training Data Figure 1 1

Description

Raw Data

Mono conversion

Voice Activity Detection

Audio Segment

transform to .mat format

Training Data

Figure 1

TITLE A speaker identification method based on deep learning

FIELD OF THE INVENTION This invention lies in the field of digital audio processing and serves as speakers'identities recognition powered by deep learning.

BACKGROUND OF THE INVENTION Human speech is produced by a complex biophysical process of reactions between the human speech center and the vocal organs. Vocal organs such as the tongue, teeth, throat, lungs and nose vary in size and shape from person to person, and the sonic map of any two individuals will have some variation. In the same way as other biometric authentication technologies, speech recognition has the advantages of being unforgettable, requiring no memory and being easy to use. In the field of biometric authentication technology, speech recognition technology has received widespread attention for its unique convenience, economy and accuracy, and it is becoming an important and popular security authentication method in people's daily life and work. In the past two decades, speech recognition technology has madesignificant progress and is beginning to move from the lab to the marketplace. To be more specific, Sir that is developed by Apple, and Alexa which is invented by Amazon are becoming prevailing in humans' daily lives, and both of them have implemented speaker recognition, so that it can recognize the identity of the speaker. In terms of the conventional automatic speech recognition methods, the effective and rapid adaption methods to specific accents and corruptions are still an issue that needed to be solved. To solve this problem, deep learning is utilized for automatic speech recognition. Overfitting to training data, however,will become a problem to be solved. Furthermore, naively trained Deep Neural Networks are usually treated as "black boxes" and they are also difficult to interpret directly due to the highly distributed representation. From a general view, the speaker recognition can be mainly divided into three categories according to the applications, including speaker identification, speaker verification, and speaker diarization. The three phases of speaker verification can be briefly described as the training phase, enrollment phase, and evaluation phase. In addition, the type of speaker verification system is based on the type of data used for enrollment and recognition, which are text-dependent and text-independent mode of operations. From the technical perspective, the characteristics or features in the audio, which are different among speakers, and used for speaker authentication, surveillance, and forensics, can be analyzed by the speech recognition. Specifically, the process of checking the authenticity of the speaker can be described as the comparison between the speaker's speech and template speech patterns of many speakers already enrolled in the system. The general technique for Voice activity detection (VAD) is dual-threshold endpoint detection that is based on short-time energy and zero-crossing rate. By applying this technique, the effective speech segments can be recognized and detected. Mel-Freguency Ceptral Coefficients (MFCC) is a widely used feature in automatic speech recognition, which was developed by Davis and Mermelstein in 1980. The process of extracting MFCC features can be concluded into seven phases, namely pre-emphasis, framing and windowing; fast Fourier transform; take the absolute or square value; Mel filter; logarithm and eventually, Delta MFCC. Deep Neural Network (DNN) was first introduced in 1960s, but the first practical application was not introduced until the late 1980s, which is LeNet for recognizing handwritten numbers. The breakthrough in DNN-based speech recognition, however, was not made until 2011. The great success of deep learning around 2010 is mainly due to three factors: the vast amount of information required to train the network, the availabilty of sufficient computing resources, and the evolution of algorithm, which has greatly improved the accuracy and broadened the scope of applications of DNN. Technically, DNNsintroduce multiple layers of non-linear processing units, which allow complex data to be well modelled. While the representations in a multi-layer configuration from raw data representation can be yielded automatically. Therefore, the signal spectrograms can be directly used in general purpose learning algorithm as raw features. In this invention, we concentrate on recognizing the identity of the speaker by using a text-independent speech recognition system based on deep learning. To solve the previous issues in the traditional speech recognition system, we plan to optimize the architecture through regularization, dropout and the number of iterations. Additionally, TensorFlow will also be used in this invention as the deep learning framework forimplementing the model, since it is an end-to-end open source machine learning platform with a comprehensive and flexible ecosystem of tools, libraries and community resources.

SUMMARY OF THE INVENTION In order to implement automatic audio feature extraction and solve the shortcomings and deficienies in the existing technology, this invention proposes a speech recognition method for quickly identifying the identity of the speaker based on deep learning. From an overall perspective, the procedure of developing our system can be divided into five parts, which are data processing, architecture design, architecture training, optimization and testing. Our deep learning speech recognition invention also consists of several components, which are speech segment database, convolutional neural networks, parameter optimization and implementation of speech recognition.

Data processing The speech segment database was established by collecting our voice records. Then, we preprocessed all the audio segments by converting them to mono and splitting them into identical length for about seven or eleven seconds long. After the separation of audio segments, they are divided into training data set and testing data set. Meanwhile, Voice Activity Detection (VAD) is applied to all the segments for detecting the validated fragment, to be more specific, checking the start point and the end point of the segment. In this case, the fragment with voice will be detected and split out for feature extraction, which will be done by applying Mel-frequency cepstral coefficients (MFCC). Eventually, the convolutional neural network will be implemented according to our design, with the extracted features as its input.

Architecture design Deep learning was applied for this speech recognition system, therefore, the architecture of this system is designed with three convolutional layers followed by two fully connected layers, and its architecture is shown in Figure 4. To take full benefits of automatic feature extraction in deep learning, the layer-by-layer initializing training mode was adopted. As a result, the training time can beimproved and the technical difficulties, for example, overfitting, will also be overcome during the training phase. The difficulties of audio feature extraction and automatic audio recognition will be solved by applying this method.

Convolutional neural network (CNN) A basic Convolutional neural network (CNN) consists of three structures, including convolution, activation and pooling. So, the CNN of this invention is a sequence of layers, specifically, three convolutional layers. As for the convolutional layer, it consists of several convolutional units, and the first convolutional layer may only be able to extract some low level feature so that the following layers can extract more complex features from the low level features iteratively. The most general operation in CNN is convolution, which uses convolutional kernel for extracting features from input image. A convolutional kernel is a matrix that slides over an image and multiplies it with input to enhance the output as expected, meanwhile, sliding window is used. The output of the system at a certain time is the result of the interaction of multiple inputs, then the size of the output data can be calculated as the formula (1) shown below, where W is the size of the input unit, Fis receptive field, S is stride, and P is zero-padding. W-F+2P S

Fully connected neural network Besides the convolutional layers, the fully connected layers play a role of a "classifier" in the entire neural network. Unlike convolutional neural network, each node in fully connected neural network is connected to all the nodes of the previous layer and is used to synthesize the features extracted previously and map the input image to the label set, namely classify.

Pooling Pooling is a subsampling operation whose primary goal is to reduce the feature map's feature space or, perhaps, to reduce the feature map's resolution. Since there are too many feature map parameters, details are not beneficial for extracting high-level features. The pooling layer in our invention is sandwiched between successive convolutional layers and is used to compress the amount of data and parameters and reduce overfitting.

Activation function After convolution, bias is usually implemented and a non-linear activation function is introduced, where bias is defined as B and activation function is h(). The result after activation can be indicated in formula (2). z,.=h *WiVi +b) (2)

Architecturetraining Convolutional neural network is essentially an input-output mapping. It can learn a large number of mapping relationships between inputs and outputswithout requiring any exact mathematical expressions between them. As long as the convolutional neural network is trained in a specific pattern, the network will have the ability to map between other inputs and outputs. The training phase of our convolutional neural network can be concluded into two phases, which are forward propagation phase and backward propagation phase. In the first phase, a sample is taken from the sample set and input into the network, then the corresponding actual output can be calculated. In this phase, data is transferred from the input layer to the output layer through a step-by-step transformation. In terms of the second phase of backward propagation, the error between the actual output and the corresponding ideal output is calculated, then the gradient of each weightwill also be calculated. Eventually, the weight can be updated byimplementing gradient descent algorithm.

Optimization In terms of the optimization of the convolutional neural network, the training data set was put in batches into the convolutional neural network, for reducing the loss function. Stochastic gradient descent optimization (SGD) and Adam were also chosen, so that it can reduce the stress on the machine and allow for faster convergence. Toeliminate overfitting, L2 regularization was implemented, because of its uniqueness. Besides the regularization, dropout also was used for reducing overfitting, which means some nodes can be discarded with a given probability in the fully connected layer.

Regulation Regularization is an effective way to avoid overfitting models and ensure generalization ability by displaying control model complexity in machine learning. Regularization introduces a model complexity index into the loss function and weakens the noise in the training data by using W-weighted values. The regularization that has been used in this invention is L2 regularization, it adds the sum of squares of the weight parameters to the original loss function. The formula for calculation is shown in the formula (3), where E is the training sample error without the regularization item, and A is only used for fully connected neural network.

L = E + A jw2 (3)

Dropout Dropout is a network regularization method currently used by almost all deep convolution neural networks with fully connected layers. The idea behind dropout is that, for each neuron in a certain layer, the weight of the neuron is randomly reset to 0 with probability p in the training phase. All neurons are activated in the test phase, but their weights need to be multiplied by (1-p) to ensure that all weights have the same expectation in the training and testing stages. In this invention, dropout is applied for both input layer and hidden layer in the system.

Aadm Adam optimization algorithm is an extension of stochastic gradient descent method (SGD). What makes it different from SGD is the combination of advantages of two extensions of SGD, namely Adaptive gradient algorithm (AdaGrad) and Root mean square propagation (RMSProp). AdaGrad maintains the learning rate of one parameter, which can improve the performance on sparse gradient problems. RMSProp also maintains the learning rate of each parameter, which is adjusted according to the average value of the nearest weight gradient, such as the rate of change, which means that the algorithm performs well on line and non stationary problems. Therefore, Adam is also used for optimization in ourinvention.

Testing After the training phase and optimization, the invention can be able to recognize the identity of the speaker and ready for testing. As a result, thisinvention can recognize the speaker through a seven seconds long(at least) voice clip quickly.

DESCRIPTION OF DRAWINGS Figure 1 is the flow block diagram of data preprocessing. Figure 2 is the flow block diagram of model training. Figure 3 is the flow block diagram of trained model making predictions. Figure 4 is the architecture of the convolutional neural network in ourinvention. Figure 5 indicated the general structure of the deep neural network.

DESCRIPTION OF PREFERRED EMBODIMENT

Design The architecture of the convolutional neural network is shown in Figure 4. It has three convolutionallayers, after each convolutional layer is a maximum pooling layer, and uses a flatten layer and two fully connected layers to form the entire neural network. Our network architecture is inspired by the good features of convolutional neural networks in processing multidimensional data. We collected about 180 minutes of audio data from two different individuals, and cut them into 11-second segments for data storage. Figure 1 display the procedure of how we preprocess the data, which applies to all training and validation data. Our network has three convolutional layers followed by two fully connected layers. Each convolutional layer consists of three components: convolution, batch normalization, and pooling. Convolution as itself is a 3x3 kernel (the entries are learned during training) that filters across the entire audio data. The stride is set to be one and zero padding is added to preserve the peripheral information. Right after convolution, there is a pooling to down-sample the previous output. It is then followed by a ReLU activation. These three components are grouped together in this order and repeated three times before the flatten layer to transform the 2D array into a iD vector. This vector is then fed into two consecutive fully connected layers to produce the final prediction.

Procedure The procedure of this invention can bedivided into the following steps.

Step 1: Data collection In this invention, all the training data(180 minutes) are voice records only, and some of the data is added with Gauss noise. For all the voice records that have been collected, the separation was implemented to get audio segments with identical length. In total, each speaker has approximately 445 audio segments for training and over 250 voice clips for testing, and each segment has seven or eleven seconds.

Step 2: Data preprocessing 1) Converting all the audio segments to mono. 2) Cutting all audio into eleven-second segments. 3) Extracting MFCC features of each audio segment into a numpy array. 4) Transforming the numpy array to the format of '.mat' for further transformation. 5) Transforming the shape of the data: The data of the audio segments will be reshaped to a matrix of 32x32x26 for the later calculationsin the convolutional neural network. 6) One hot encoding: The name of the speaker is shown in each label is named by different arrays, so that it can be used for classifying the speaker. Step 3: Training the optimization We use CNN model which contains three convolutional layers and two fully connected layers:

Layer (type) Output Shape Param #

conv2d (Conv2D) (None,30,30,32) 7520 max pooling2d (None, 15,15, 32) 0 (MaxPooling2D) conv2d_1(Conv2D) (None,13,13,32) 9248 maxpooling2d_1 (None, 6, 6, 32) 0 (MaxPooling2D) conv2d_2(Conv2D) (None,4,4,128) 36992 maxpooling2d_2 (None, 2, 2, 128) 0 (MaxPooling2D) flatten (Flatten) (None, 512) 0 dense (Dense) (None,256) 131328 dense_1(Dense) (None, 64) 16448 dense_1(Dense) (None,2) 130

Step 4: Testing

We take more than 500 unlabeled audio data as input, 7-second segment for multichannel and 11-second segment for mono, and evaluate all nodes in the CNN to output final prediction. In the first attempt, our recognition results are as follows:

Test Data Type Result(Recognition Rate) Multichannel, No-noise 100% Mono, No-noise 100% Multichannel, Gauss-noise 100% Mono, Gauss-noise 70% Overall 92.6%

Then we adjusted some parameters, and the final key parameters were: Optimizer: AdamOptimizer Learning-rate: 0.001 Kernel_Initializer: truncatednormal_initializer(stddev=0.01) Epoch: 6 Batch Size: 10

And the adjusted training results are as follows:

Test Data Type Result(Recognition Rate) Multichannel, No-noise 100% Mono, No-noise 100% Multichannel, Gauss-noise 100% Mono, Gauss-noise 100% Overall 100%

Claims

CLAIM

1. A speaker identification method based on deep learning, characterized in that, a method for speech recognition based on deep learning, which includes Voice Activity Detection for detecting sufficient audio segment, Mel-frequency cepstral coefficients for feature extraction, and Convolutional Neural Network for identifying the identity of the speaker.

2. According to method of claim 1, wherein has a simple architecture, which can be applied to speaker identification quickly; as the speech time of our samples is not very long, our invention can also be applied to simple and rapid speaker identification.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5