AU2020102038A4 - A speaker identification method based on deep learning - Google Patents

A speaker identification method based on deep learning Download PDF

Info

Publication number
AU2020102038A4
AU2020102038A4 AU2020102038A AU2020102038A AU2020102038A4 AU 2020102038 A4 AU2020102038 A4 AU 2020102038A4 AU 2020102038 A AU2020102038 A AU 2020102038A AU 2020102038 A AU2020102038 A AU 2020102038A AU 2020102038 A4 AU2020102038 A4 AU 2020102038A4
Authority
AU
Australia
Prior art keywords
data
training
speaker
deep learning
cnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2020102038A
Inventor
Yichen Jia
Jinyi Liu
Yanan WU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liu Jinyi Miss
Wu Yanan Miss
Original Assignee
Liu Jinyi Miss
Wu Yanan Miss
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liu Jinyi Miss, Wu Yanan Miss filed Critical Liu Jinyi Miss
Priority to AU2020102038A priority Critical patent/AU2020102038A4/en
Application granted granted Critical
Publication of AU2020102038A4 publication Critical patent/AU2020102038A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)

Abstract

This invention lies in the field of digital audio processing, which is a speech recognition system for identifying different identities based on deep learning. The invention consists of the following steps: First of all, the preparation of sufficient data was made, and the data was also split into training and testing data. Secondly, we preprocessed the data by using Voice Activity Detection (VAD) for detecting the effective audio segments, and Mel-frequency cepstral coefficients (MFCC) for feature extraction. Then, the training data were batched into the Convolutional Neural Network (CNN) that we have already set. Simultaneously, the parameters in the CNN were adjusted, namely dropout rate, learning base rate, loss rate, in order to optimize the performance of the model. Eventually, the optimal CNN can be used for the testing data, and the identities can be recognized with an accuracy of 92.6%. In brief, the identity of the speaker can be recognized automatically without human involvement by this invention. 1 Raw Data Mono conversion Voice Activity Detection Audio Segment transform to .mat format Training Data Figure 1 1

Description

Raw Data
Mono conversion
Voice Activity Detection
Audio Segment
transform to .mat format
Training Data
Figure 1
TITLE A speaker identification method based on deep learning
FIELD OF THE INVENTION This invention lies in the field of digital audio processing and serves as speakers'identities recognition powered by deep learning.
BACKGROUND OF THE INVENTION Human speech is produced by a complex biophysical process of reactions between the human speech center and the vocal organs. Vocal organs such as the tongue, teeth, throat, lungs and nose vary in size and shape from person to person, and the sonic map of any two individuals will have some variation. In the same way as other biometric authentication technologies, speech recognition has the advantages of being unforgettable, requiring no memory and being easy to use. In the field of biometric authentication technology, speech recognition technology has received widespread attention for its unique convenience, economy and accuracy, and it is becoming an important and popular security authentication method in people's daily life and work. In the past two decades, speech recognition technology has madesignificant progress and is beginning to move from the lab to the marketplace. To be more specific, Sir that is developed by Apple, and Alexa which is invented by Amazon are becoming prevailing in humans' daily lives, and both of them have implemented speaker recognition, so that it can recognize the identity of the speaker. In terms of the conventional automatic speech recognition methods, the effective and rapid adaption methods to specific accents and corruptions are still an issue that needed to be solved. To solve this problem, deep learning is utilized for automatic speech recognition. Overfitting to training data, however,will become a problem to be solved. Furthermore, naively trained Deep Neural Networks are usually treated as "black boxes" and they are also difficult to interpret directly due to the highly distributed representation. From a general view, the speaker recognition can be mainly divided into three categories according to the applications, including speaker identification, speaker verification, and speaker diarization. The three phases of speaker verification can be briefly described as the training phase, enrollment phase, and evaluation phase. In addition, the type of speaker verification system is based on the type of data used for enrollment and recognition, which are text-dependent and text-independent mode of operations. From the technical perspective, the characteristics or features in the audio, which are different among speakers, and used for speaker authentication, surveillance, and forensics, can be analyzed by the speech recognition. Specifically, the process of checking the authenticity of the speaker can be described as the comparison between the speaker's speech and template speech patterns of many speakers already enrolled in the system. The general technique for Voice activity detection (VAD) is dual-threshold endpoint detection that is based on short-time energy and zero-crossing rate. By applying this technique, the effective speech segments can be recognized and detected. Mel-Freguency Ceptral Coefficients (MFCC) is a widely used feature in automatic speech recognition, which was developed by Davis and Mermelstein in 1980. The process of extracting MFCC features can be concluded into seven phases, namely pre-emphasis, framing and windowing; fast Fourier transform; take the absolute or square value; Mel filter; logarithm and eventually, Delta MFCC. Deep Neural Network (DNN) was first introduced in 1960s, but the first practical application was not introduced until the late 1980s, which is LeNet for recognizing handwritten numbers. The breakthrough in DNN-based speech recognition, however, was not made until 2011. The great success of deep learning around 2010 is mainly due to three factors: the vast amount of information required to train the network, the availabilty of sufficient computing resources, and the evolution of algorithm, which has greatly improved the accuracy and broadened the scope of applications of DNN. Technically, DNNsintroduce multiple layers of non-linear processing units, which allow complex data to be well modelled. While the representations in a multi-layer configuration from raw data representation can be yielded automatically. Therefore, the signal spectrograms can be directly used in general purpose learning algorithm as raw features. In this invention, we concentrate on recognizing the identity of the speaker by using a text-independent speech recognition system based on deep learning. To solve the previous issues in the traditional speech recognition system, we plan to optimize the architecture through regularization, dropout and the number of iterations. Additionally, TensorFlow will also be used in this invention as the deep learning framework forimplementing the model, since it is an end-to-end open source machine learning platform with a comprehensive and flexible ecosystem of tools, libraries and community resources.
SUMMARY OF THE INVENTION In order to implement automatic audio feature extraction and solve the shortcomings and deficienies in the existing technology, this invention proposes a speech recognition method for quickly identifying the identity of the speaker based on deep learning. From an overall perspective, the procedure of developing our system can be divided into five parts, which are data processing, architecture design, architecture training, optimization and testing. Our deep learning speech recognition invention also consists of several components, which are speech segment database, convolutional neural networks, parameter optimization and implementation of speech recognition.
Data processing The speech segment database was established by collecting our voice records. Then, we preprocessed all the audio segments by converting them to mono and splitting them into identical length for about seven or eleven seconds long. After the separation of audio segments, they are divided into training data set and testing data set. Meanwhile, Voice Activity Detection (VAD) is applied to all the segments for detecting the validated fragment, to be more specific, checking the start point and the end point of the segment. In this case, the fragment with voice will be detected and split out for feature extraction, which will be done by applying Mel-frequency cepstral coefficients (MFCC). Eventually, the convolutional neural network will be implemented according to our design, with the extracted features as its input.
Architecture design Deep learning was applied for this speech recognition system, therefore, the architecture of this system is designed with three convolutional layers followed by two fully connected layers, and its architecture is shown in Figure 4. To take full benefits of automatic feature extraction in deep learning, the layer-by-layer initializing training mode was adopted. As a result, the training time can beimproved and the technical difficulties, for example, overfitting, will also be overcome during the training phase. The difficulties of audio feature extraction and automatic audio recognition will be solved by applying this method.
Convolutional neural network (CNN) A basic Convolutional neural network (CNN) consists of three structures, including convolution, activation and pooling. So, the CNN of this invention is a sequence of layers, specifically, three convolutional layers. As for the convolutional layer, it consists of several convolutional units, and the first convolutional layer may only be able to extract some low level feature so that the following layers can extract more complex features from the low level features iteratively. The most general operation in CNN is convolution, which uses convolutional kernel for extracting features from input image. A convolutional kernel is a matrix that slides over an image and multiplies it with input to enhance the output as expected, meanwhile, sliding window is used. The output of the system at a certain time is the result of the interaction of multiple inputs, then the size of the output data can be calculated as the formula (1) shown below, where W is the size of the input unit, Fis receptive field, S is stride, and P is zero-padding. W-F+2P S
Fully connected neural network Besides the convolutional layers, the fully connected layers play a role of a "classifier" in the entire neural network. Unlike convolutional neural network, each node in fully connected neural network is connected to all the nodes of the previous layer and is used to synthesize the features extracted previously and map the input image to the label set, namely classify.
Pooling Pooling is a subsampling operation whose primary goal is to reduce the feature map's feature space or, perhaps, to reduce the feature map's resolution. Since there are too many feature map parameters, details are not beneficial for extracting high-level features. The pooling layer in our invention is sandwiched between successive convolutional layers and is used to compress the amount of data and parameters and reduce overfitting.
Activation function After convolution, bias is usually implemented and a non-linear activation function is introduced, where bias is defined as B and activation function is h(). The result after activation can be indicated in formula (2). z,.=h *WiVi +b) (2)
Architecturetraining Convolutional neural network is essentially an input-output mapping. It can learn a large number of mapping relationships between inputs and outputswithout requiring any exact mathematical expressions between them. As long as the convolutional neural network is trained in a specific pattern, the network will have the ability to map between other inputs and outputs. The training phase of our convolutional neural network can be concluded into two phases, which are forward propagation phase and backward propagation phase. In the first phase, a sample is taken from the sample set and input into the network, then the corresponding actual output can be calculated. In this phase, data is transferred from the input layer to the output layer through a step-by-step transformation. In terms of the second phase of backward propagation, the error between the actual output and the corresponding ideal output is calculated, then the gradient of each weightwill also be calculated. Eventually, the weight can be updated byimplementing gradient descent algorithm.
Optimization In terms of the optimization of the convolutional neural network, the training data set was put in batches into the convolutional neural network, for reducing the loss function. Stochastic gradient descent optimization (SGD) and Adam were also chosen, so that it can reduce the stress on the machine and allow for faster convergence. Toeliminate overfitting, L2 regularization was implemented, because of its uniqueness. Besides the regularization, dropout also was used for reducing overfitting, which means some nodes can be discarded with a given probability in the fully connected layer.
Regulation Regularization is an effective way to avoid overfitting models and ensure generalization ability by displaying control model complexity in machine learning. Regularization introduces a model complexity index into the loss function and weakens the noise in the training data by using W-weighted values. The regularization that has been used in this invention is L2 regularization, it adds the sum of squares of the weight parameters to the original loss function. The formula for calculation is shown in the formula (3), where E is the training sample error without the regularization item, and A is only used for fully connected neural network.
L = E + A jw2 (3)
Dropout Dropout is a network regularization method currently used by almost all deep convolution neural networks with fully connected layers. The idea behind dropout is that, for each neuron in a certain layer, the weight of the neuron is randomly reset to 0 with probability p in the training phase. All neurons are activated in the test phase, but their weights need to be multiplied by (1-p) to ensure that all weights have the same expectation in the training and testing stages. In this invention, dropout is applied for both input layer and hidden layer in the system.
Aadm Adam optimization algorithm is an extension of stochastic gradient descent method (SGD). What makes it different from SGD is the combination of advantages of two extensions of SGD, namely Adaptive gradient algorithm (AdaGrad) and Root mean square propagation (RMSProp). AdaGrad maintains the learning rate of one parameter, which can improve the performance on sparse gradient problems. RMSProp also maintains the learning rate of each parameter, which is adjusted according to the average value of the nearest weight gradient, such as the rate of change, which means that the algorithm performs well on line and non stationary problems. Therefore, Adam is also used for optimization in ourinvention.
Testing After the training phase and optimization, the invention can be able to recognize the identity of the speaker and ready for testing. As a result, thisinvention can recognize the speaker through a seven seconds long(at least) voice clip quickly.
DESCRIPTION OF DRAWINGS Figure 1 is the flow block diagram of data preprocessing. Figure 2 is the flow block diagram of model training. Figure 3 is the flow block diagram of trained model making predictions. Figure 4 is the architecture of the convolutional neural network in ourinvention. Figure 5 indicated the general structure of the deep neural network.
DESCRIPTION OF PREFERRED EMBODIMENT
Design The architecture of the convolutional neural network is shown in Figure 4. It has three convolutionallayers, after each convolutional layer is a maximum pooling layer, and uses a flatten layer and two fully connected layers to form the entire neural network. Our network architecture is inspired by the good features of convolutional neural networks in processing multidimensional data. We collected about 180 minutes of audio data from two different individuals, and cut them into 11-second segments for data storage. Figure 1 display the procedure of how we preprocess the data, which applies to all training and validation data. Our network has three convolutional layers followed by two fully connected layers. Each convolutional layer consists of three components: convolution, batch normalization, and pooling. Convolution as itself is a 3x3 kernel (the entries are learned during training) that filters across the entire audio data. The stride is set to be one and zero padding is added to preserve the peripheral information. Right after convolution, there is a pooling to down-sample the previous output. It is then followed by a ReLU activation. These three components are grouped together in this order and repeated three times before the flatten layer to transform the 2D array into a iD vector. This vector is then fed into two consecutive fully connected layers to produce the final prediction.
Procedure The procedure of this invention can bedivided into the following steps.
Step 1: Data collection In this invention, all the training data(180 minutes) are voice records only, and some of the data is added with Gauss noise. For all the voice records that have been collected, the separation was implemented to get audio segments with identical length. In total, each speaker has approximately 445 audio segments for training and over 250 voice clips for testing, and each segment has seven or eleven seconds.
Step 2: Data preprocessing 1) Converting all the audio segments to mono. 2) Cutting all audio into eleven-second segments. 3) Extracting MFCC features of each audio segment into a numpy array. 4) Transforming the numpy array to the format of '.mat' for further transformation. 5) Transforming the shape of the data: The data of the audio segments will be reshaped to a matrix of 32x32x26 for the later calculationsin the convolutional neural network. 6) One hot encoding: The name of the speaker is shown in each label is named by different arrays, so that it can be used for classifying the speaker. Step 3: Training the optimization We use CNN model which contains three convolutional layers and two fully connected layers:
Layer (type) Output Shape Param #
conv2d (Conv2D) (None,30,30,32) 7520 max pooling2d (None, 15,15, 32) 0 (MaxPooling2D) conv2d_1(Conv2D) (None,13,13,32) 9248 maxpooling2d_1 (None, 6, 6, 32) 0 (MaxPooling2D) conv2d_2(Conv2D) (None,4,4,128) 36992 maxpooling2d_2 (None, 2, 2, 128) 0 (MaxPooling2D) flatten (Flatten) (None, 512) 0 dense (Dense) (None,256) 131328 dense_1(Dense) (None, 64) 16448 dense_1(Dense) (None,2) 130
Step 4: Testing
We take more than 500 unlabeled audio data as input, 7-second segment for multichannel and 11-second segment for mono, and evaluate all nodes in the CNN to output final prediction. In the first attempt, our recognition results are as follows:
Test Data Type Result(Recognition Rate) Multichannel, No-noise 100% Mono, No-noise 100% Multichannel, Gauss-noise 100% Mono, Gauss-noise 70% Overall 92.6%
Then we adjusted some parameters, and the final key parameters were: Optimizer: AdamOptimizer Learning-rate: 0.001 Kernel_Initializer: truncatednormal_initializer(stddev=0.01) Epoch: 6 Batch Size: 10
And the adjusted training results are as follows:
Test Data Type Result(Recognition Rate) Multichannel, No-noise 100% Mono, No-noise 100% Multichannel, Gauss-noise 100% Mono, Gauss-noise 100% Overall 100%

Claims (2)

CLAIM
1. A speaker identification method based on deep learning, characterized in that, a method for speech recognition based on deep learning, which includes Voice Activity Detection for detecting sufficient audio segment, Mel-frequency cepstral coefficients for feature extraction, and Convolutional Neural Network for identifying the identity of the speaker.
2. According to method of claim 1, wherein has a simple architecture, which can be applied to speaker identification quickly; as the speech time of our samples is not very long, our invention can also be applied to simple and rapid speaker identification.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
AU2020102038A 2020-08-28 2020-08-28 A speaker identification method based on deep learning Ceased AU2020102038A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2020102038A AU2020102038A4 (en) 2020-08-28 2020-08-28 A speaker identification method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2020102038A AU2020102038A4 (en) 2020-08-28 2020-08-28 A speaker identification method based on deep learning

Publications (1)

Publication Number Publication Date
AU2020102038A4 true AU2020102038A4 (en) 2020-10-08

Family

ID=72663853

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2020102038A Ceased AU2020102038A4 (en) 2020-08-28 2020-08-28 A speaker identification method based on deep learning

Country Status (1)

Country Link
AU (1) AU2020102038A4 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345461A (en) * 2021-04-26 2021-09-03 北京搜狗科技发展有限公司 Voice processing method and device for voice processing
CN113380255A (en) * 2021-05-19 2021-09-10 浙江工业大学 Voiceprint recognition poisoning sample generation method based on transfer training
CN113611314A (en) * 2021-08-03 2021-11-05 成都理工大学 Speaker identification method and system
CN113627300A (en) * 2021-08-02 2021-11-09 中电福富信息科技有限公司 Face recognition and living body detection method based on deep learning
CN114495947A (en) * 2022-03-04 2022-05-13 蔚来汽车科技(安徽)有限公司 Method and apparatus for detecting voice activity
CN115171700A (en) * 2022-06-13 2022-10-11 武汉大学 Voiceprint recognition voice assistant method based on pulse neural network
WO2023036016A1 (en) * 2021-09-07 2023-03-16 广西电网有限责任公司贺州供电局 Voiceprint recognition method and system applied to electric power operation

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345461A (en) * 2021-04-26 2021-09-03 北京搜狗科技发展有限公司 Voice processing method and device for voice processing
CN113380255A (en) * 2021-05-19 2021-09-10 浙江工业大学 Voiceprint recognition poisoning sample generation method based on transfer training
CN113380255B (en) * 2021-05-19 2022-12-20 浙江工业大学 Voiceprint recognition poisoning sample generation method based on transfer training
CN113627300A (en) * 2021-08-02 2021-11-09 中电福富信息科技有限公司 Face recognition and living body detection method based on deep learning
CN113611314A (en) * 2021-08-03 2021-11-05 成都理工大学 Speaker identification method and system
WO2023036016A1 (en) * 2021-09-07 2023-03-16 广西电网有限责任公司贺州供电局 Voiceprint recognition method and system applied to electric power operation
CN114495947A (en) * 2022-03-04 2022-05-13 蔚来汽车科技(安徽)有限公司 Method and apparatus for detecting voice activity
CN115171700A (en) * 2022-06-13 2022-10-11 武汉大学 Voiceprint recognition voice assistant method based on pulse neural network
CN115171700B (en) * 2022-06-13 2024-04-26 武汉大学 Voiceprint recognition voice assistant method based on impulse neural network

Similar Documents

Publication Publication Date Title
AU2020102038A4 (en) A speaker identification method based on deep learning
Jahangir et al. Text-independent speaker identification through feature fusion and deep neural network
Liu et al. GMM and CNN hybrid method for short utterance speaker recognition
Lian et al. Speech emotion recognition via contrastive loss under siamese networks
Ding et al. Autospeech: Neural architecture search for speaker recognition
CN109949824B (en) City sound event classification method based on N-DenseNet and high-dimensional mfcc characteristics
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
Ohi et al. Deep speaker recognition: Process, progress, and challenges
CN112581979A (en) Speech emotion recognition method based on spectrogram
CN105206270A (en) Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
CN113312989B (en) Finger vein feature extraction network based on aggregated descriptors and attention
CN115862684A (en) Audio-based depression state auxiliary detection method for dual-mode fusion type neural network
CN108364662A (en) Based on the pairs of speech-emotion recognition method and system for differentiating task
CN112329819A (en) Underwater target identification method based on multi-network fusion
CN111986699A (en) Sound event detection method based on full convolution network
CN116758451A (en) Audio-visual emotion recognition method and system based on multi-scale and global cross attention
CN117746908A (en) Voice emotion recognition method based on time-frequency characteristic separation type transducer cross fusion architecture
CN117854473B (en) Zero sample speech synthesis method based on local association information
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
Alashban et al. Speaker gender classification in mono-language and cross-language using BLSTM network
Chinmayi et al. Emotion Classification Using Deep Learning
JPH09507921A (en) Speech recognition system using neural network and method of using the same
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
Flower et al. A novel concatenated 1D-CNN model for speech emotion recognition
CN117976006A (en) Audio processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry