AU2019101222A4 - A Speaker Recognition System Based on Deep Learning - Google Patents

A Speaker Recognition System Based on Deep Learning Download PDF

Info

Publication number
AU2019101222A4
AU2019101222A4 AU2019101222A AU2019101222A AU2019101222A4 AU 2019101222 A4 AU2019101222 A4 AU 2019101222A4 AU 2019101222 A AU2019101222 A AU 2019101222A AU 2019101222 A AU2019101222 A AU 2019101222A AU 2019101222 A4 AU2019101222 A4 AU 2019101222A4
Authority
AU
Australia
Prior art keywords
data
layer
rate
layers
convolutional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2019101222A
Inventor
Yuyao Feng
Haowei LI
Haotian WANG
Zihan Yi
Sijie ZHANG
Chao Zhu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhang Sijie Miss
Original Assignee
Zhang Sijie Miss
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhang Sijie Miss filed Critical Zhang Sijie Miss
Priority to AU2019101222A priority Critical patent/AU2019101222A4/en
Application granted granted Critical
Publication of AU2019101222A4 publication Critical patent/AU2019101222A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

This invention lies in the field of digital signal processing. This is a speech recognition system that identifies the different speakers based on deep learning. The invention consists of the following steps: Firstly, we collect the voice data from different people. Secondly, the data having been selected is preprocessed by extracting their Mel Frequency Cepstral Coefficients (MFCC) and is divided into training set and test set randomly. Thirdly, we cut the training set into batches, and put them into the convolutional neural network which consists of convolutional layers, max pooling layers and fully connected layers. After repeatedly adjusting the parameters of the network such as learning rate, dropout rate and decay rate, the model will reach the optimal performance. Finally, the testing set is also cut into batches and put into the trained neural network. The final recognition accuracy rate is 70.23%. In brief, this invention can automatically recognize different speakers efficiently.

Description

A Speaker Recognition System Based on Deep Learning
FIELD OF THE INVENTION
This invention is in the field of digital signal processing and serves as identification of different speakers’ identity by deep learning.
BACKGROUND
Speech is an important biological information and the most effective way of human communication. With the rapid development of information technology, computer is widely used in various aspects of human social life and speech recognition can improve work efficiency in many situations such as telephone communication, mobile payments, machine control and airport security. How to make the computer recognize the identity of the speaker in the human-computer interaction is of great significance to public safety, information security and property security.
In the 1950s, a speech recognition system that recognized 10 English digital pronunciations was accomplished in Bell Labs, marking the beginning of modem speech recognition research. With the wide application of computer technology in the field of speech recognition, a series of important research results have been obtained. The concept of Mel Frequency Cepstral Coefficients (MFCC) was proposed in the 1980s and is still an important feature parameter in speech recognition. The introduction of artificial neural network (ANN) has injected new vitality
2019101222 05 Oct 2019 into speech recognition technology and promoted its development.
Deep learning is a branch of machine learning research which can be understood as the development of ANN. It is essentially a method of training deep structural models and it is also an algorithm for modeling complex relationships between data through multiple layers. With the development of high-performance computing platform and big data processing technique, deep learning has become a powerful method for speech recognition. In many popular models, such as VGG, more convolutional layers and pooling layers are needed, and the processing of data is more complicated.
In this invention, TensorFlow is used to implement the deep learning framework. We first collect voice clips of different people in the environment of daily life and use MFCC to preprocess the data. After that, we randomly feed the training data set into a simpler convolutional neural network in batches. Parameters including learning rate, dropout rate and decay rate are optimized by observing the model's average accuracy and variance of the testing set until the model achieves the best performance. In addition, the model can also be used in fields such as image recognition.
2019101222 05 Oct 2019
SUMMARY
The term of this project is to recognize the identity of the speaker. The technical solution of this invention is implemented as follows:
We collect voice data from different people. All data is cut into a few segments. In order to get sufficient data, each fragment is cut into more parts with each part overlapping a little with each other. Next, we extract Mel Frequency Cepstral Coefficients (MFCC) for all segments. The output of MFCC is reshaped and appended with labels. Then, all data is saved as MATFAB(.mat) files and divided into training set and testing set.
Before training, the .mat files are loaded into storage to do some preprocessing work such as one-hot encoding and normalization. Both methods allow the computer to process the characteristic values more efficiently.
Then we design our convolutional network frame. Our convolutional network consists of several layers. There are in sum four convolutional layers, two max pooling layers and two fully connected layers. Figure 1 shows the order of these layers.
The function of convolutional layer is to extract features from input. After convolutional layers, an output with large dimension is obtained.
The function of the max pooling layer is to decrease the dimension of data. It cuts the feature into several regions, take the maximum value in
2019101222 05 Oct 2019 each region and combine them into a new feature with a smaller dimension.
We use Rectified Linear Unit (ReLU) in convolutional layers and fully connected layer 1. The using of ReLU unit can effectively alleviate the issue of gradient disappearance and overfitting.
Fully connected layer connects every node in one layer to every node in the next layer. The input matrix goes through a fully connected layer, and the activations of nodes can thus be computed. Eventually, the classify output is calculated.
Softmax is a function that can output the probability that each classification is taken. It maps the output of multiple neurons to the interval of (0, 1)
For parameter optimization, we utilize regularization form L2 to calculate tensor loss value and improve the recognition. Then our team use function dropout to drop some nodes randomly so that the overfitting phenomenon will be eliminated. We prepared three methods to update loss and find the deep point— the steepest descent method, Momenta and Adam optimization algorithm (Adam). Compared to the other two methods, Adam is more suitable for our project. Therefore, we choose Adam for optimization.
To keep learning rate medium, we use function decay to reduce learning rate at a stable decay step. In the end, we optimize our
2019101222 05 Oct 2019 framework for us to operate the system easier and more concisely
When all the settings for data, network and optimization parameters are done, all the data is loaded from the .mat file, and then the testing and training data are divided into batches divided by an appropriate amount. After being handled by the structure of convolution and max pooling layer that mentioned above, the shape of the data will be converted into the corresponding set. At the same time, the data loss and average accuracy are calculated from the session of training and testing respectively, and the parameters are modified in the program to get the best results.
Finally, we are able to recognize the speaker at an optimal accuracy rate.
DESCRIPTION OF DRAWING
Figure 1 shows the data flow of our convolutional neural network.
Figure 2 is the structure of the convolutional neural network.
Figure 3 shows the data flow of convolutional layer 1 and 2.
Figure 4 shows the data flow of max pooling layer 1.
Figure 5 shows the data flow of convolutional layer 3 and 4.
Figure 6 shows the data flow of max pooling layer 2.
Figure 7 is the structure of the fully connected layers.
Figure 8 shows the principle of dropout
Figure 9 shows the procedure of our project.
2019101222 05 Oct 2019
DESCRIPTION OF PREFERRED EMBODIMENT
Data Preprocessing
In this project, we collect voice data from 5 people. All data is saved as WAV files and cut into 2000 segments of 7 seconds. The total amount of recording for each person is more than 45 minutes. In order to get sufficient data, each 7-second fragment is cut into 21 2-second parts with each part overlapping a little with each other. Then, we obtain 8400 recording files for each person. All files are renamed in form of X_Y where X stands for different people and Y is its serial number. This step helps us to load in data more efficiently.
Next, we extract Mel Frequency Cepstral Coefficients (MFCC) for all 42000 segments. There are mainly three steps to gain MFCC from WAV files.
(1) the spectrum of the signal is extracted by short-time Fourier transform.
(2) the energy spectrum is obtained by squaring the original spectrum and effective fragments is extracted by bandpass filtering.
(3) the logarithm of the output signal is taken from the filter and inverse Fourier transform is applied to get MFCC. A simplification of the formula is as follow:
C„=EwlogXWcos[i^] «=1,2„..Λ (1)
L is the coefficient of step numbers, and M is the number of triangle
2019101222 05 Oct 2019 filters.
The final output of MFCC is a matrix of shape [Zxl6] where Z is related to the bit rate of the source and its value should be more than 2000. However, we just need the first [1024x16] characteristic values to get a matrix of [32x32x16] for each fragment, so the surplus parts are cut off. We append a label corresponding to each person who recorded their voice. Finally, 80% of the whole data is randomly chosen as the training set while the other data composes the testing set. Both sets are saved as MATFAB files (.mat) for further use in the following steps.
Before training, we load the .mat files into storage to do some preprocessing work.
(1) The sets are reshaped and labels are transformed into one-hot encoding. For example, we transfer the label of the first speaker to [1,0,0,0,0], and the label of the third speaker to [0,0,1,0,0] (2) All characteristic values are normalized to data ranging between -1 and 1. The range of all characteristic values are unknown at the beginning. We obtain the maximum and minimum of the characteristic values which are -112 and 85 respectively. So we normalize all characteristic values by:
(2)
98.5 ’
N is the normalized value, and C is the original characteristic value.
Both methods allow the computer to process the characteristic values
2019101222 05 Oct 2019 more efficiently.
Network Design
Figure 1 shows the structure of our neural network. The network has in sum four convolutional layers, two max pooling layers and two fully connected layers. The input each passes through convolution layer 1, convolution layer 2, max pooling layer 1, convolution layer 3, convolution layer 4, max pooling layer 2, fully connected layer 1 and fully connected layer 2.
The function of convolutional layer is to extract diverse features from input. The first convolutional layer can only extract elementary features such as edge, line and comer. More layers of network can extract more complex features from low-level features through iteration.
There are some parameters referred to the calculation of convolutional layer that needs to be declared:
(1) Filter: the convolution kernel in convolutional neural network.
(2) Depth: it controls the depth of the output unit, which is equal to the number of filters.
(3) Stride: step size of the convolution kernel in convolution (4) Receptive Field: the width(height) of the convolutional kemal.
(5) Zero-padding: the value of zero-padding has two situations: “Valid” means there is no padding, while “Same” means the output image is the same size as the input image. In this program we use “Same”.
2019101222 05 Oct 2019
The width(height) of the output data matrix can be calculated by the formula below:
W-F+2P —+l ™ where W is the width(height) of input, F is the width(height) of the convolution kernel, P is the value of zero-padding and S is step size.
After convolutional layers, an output with large dimension is obtained. The function of the max pooling layer is to cut the matrix into several regions, take the maximum value in each region and combine them into a new feature with a smaller dimension.
Fully connected layer connects every node in one layer to every node in the next layer. The input matrix goes through a fully connected layer, and the activations of nodes can thus be computed. Eventually, the classify output is calculated.
We use Rectified Einear Unit (ReEU) in convolutional layers and fully connected layer 1. The function of ReEU is [, x, x > 0
ReEU(x) = max(0, x) = θ < θ (4)
The using of ReEU unit can effectively alleviate the issue of gradient disappearance and overfitting.
The specific situation of each layer in this project is introduced below.
(1) Convolutional Fay er 1
The input data of convolutional layer 1 has the shape [32x32x1], It is
2019101222 05 Oct 2019 convoluted by a [3><3xl] convolution kernel. The convolution kernel moves through the horizontal and vertical direction of the input data. Its stride is 1 and zero-padding is 1. Therefore, the width(height) of the output data is
The number of filters is 16, namely the depth of output data is 16. As a result, the shape of output data is [32x32x16].
After convolution, we input the result into ReLU unit. After this procedure, the size of data is still [32x32x16], (2) Convolutional Layer 2
The input data of convolutional layer 2 has the shape [32x32x16], It is convoluted by a [3x3x16] convolution kernel. The convolution kernel’s stride is 1 and zero-padding is 1. Therefore, same as the calculation of convolution layer 1, the width(height) of the output data is 32. The number of filters is also 16, namely the depth of output data is 16. As a result, the shape of output data is [32x32x16].
After convolution, we input the result into ReLU unit. Still, after this procedure, the size of data is [32x32x16], (3) Max Pooling Layer 1
In this invention, we set max pooling layer 1 with filter of size [2x2], with a stride of 2. It slices the input by 2 along both width and height.
2019101222 05 Oct 2019
When doing max operation every time, a 2x2 region will be taken and the max value will be calculated to substitute the region. The depth dimension remains the same. Therefore in the project the initial input volume has the size [32x32x16], and it is pooled with filter of size 2, stride 2. The output of size [16x16x16] is finally produced.
(4) Convolutional Layer 3
The input data of convolutional layer 3 has the shape [16x16x16], It is convoluted by a [3x3x16] convolution kernel. The convolution kernel’s stride is 1 and zero-padding is 1. Therefore, the width(height) of the output data is
16+2x1-3
-------+1=16
The number of filters is 16, namely the depth of output data is 16. As a result, the shape of output data is [16x16x16].
After convolution, we input the result into ReLU unit. After this procedure, the size of data is [16x16x16], (5) Convolutional Layer 4
The input data of convolutional layer 4 has the shape [16x16x16], It is convoluted by a [3x3x16] convolution kernel. The convolution kernel’s stride is 1 and zero-padding is 1. Therefore, same as the calculation of convolution layer 3, the width(height) of the output data is 16. The number of filters is also 16, namely the depth of output data is 16. As a
2019101222 05 Oct 2019 result, the shape of output data is [16x16x16].
After convolution, we input the result into ReLU unit. The size of output data is still [16x16x16], (6) Max Pooling Layer 2
Same as max pooling layer 1, max pooling layer 2 has the filter of size [2x2], with a stride of 2. When doing max operation every time, it will substitute a 2x2 region with the max value of the four values. The depth dimension remains the same. Therefore it changes the input volume of size [32x32x16] with filter of size 2, stride 2. The output size is [16x16x16], (7) Fully Connected Layer 1
After finishing convolution, we rearrange our data. We reshape the data matrix from shape [batch ,8, 8, 16] to [batch, 1024], We took the batch size of training set as 64, and test set’s as 400. Therefore for the training data, the input matrix has the shape of [64,1024] and for test data it is [400,1024], Accordingly, the input of the fully connected layer 1 has 1024 nodes, each of them is connected to all nodes of the next layer.
Then we define the value of weights of fully connected layer 1. We give each weight a value according to the law of normal distribution. The values of weights are stored in a matrix with the shape of [sizex sizex channels, number of hidden nodes], which in this program is [1024, 128], We also define the referring bias. The value is 0.1 with shape [number of
2019101222 05 Oct 2019 hidden nodes], namely [128],
To calculate the output of fully connected layer 1, we use the formula: y = wT x + b (5 in which y is the output of fully connected layer 1, w is the weight matrix, x is the input matrix, and b is bias. Here we choose ReLU as the active function. From this procedure we get the output from the fully connected layer 1. Its shape is [64, 128], (8) Fully Connected Layer 2
We put the result above into the fully connected layer 2 as its input. Similarly, we define the value of weights of fully connected layer 2 according to normal distribution. The value of weights are stored in a matrix with the shape of [number of hidden nodes, number of layers], which in this layer is [128, 5], We also define the referring bias. The value is 0.1 with shape [number of layers], namely [5],
In this layer, we use Softmax function to calculate the output. Softmax is a function that can output the probability that each classification is taken. Its formula is:
Softmax(x;) = exp(x;)
Σ”.,«χρ(ν) (6) where is the value of the z-th element. The output gives out the probability of each label, namely the probability for the data to be judged as label 0 to 4.
2019101222 05 Oct 2019
To adjust weights to optimal values, we use back propagation. Back propagation is achieved by calculating the loss of each nerve node, which is:
di di di,di.
— = w--, — = x- (—) dx dy dwdy where R js the output of node, ^xl is the input of node, nnxm/?
A is the weight of node, and is the loss.
optimization
During the implementation of our project, we found that the overfitting phenomenon occurred during the running of the program. Overfitting is that the accuracy of the training is very high, while the accuracy of the test is low. So we have these several ways to solve the problem.
(1) Regularization
Regularization have two forms- LI loss and L2 loss. L2 loss has a unique value, while LI loss doesn’t have this feature. We finally use L2 loss to optimize because it can distinguish the result which is better. The formulas for LI loss and L2 loss are:
LI: 5 = Σ|ζ-*ω|(8) /=0 «2
L2: 5 = Σ(ζ-Α(*)) ·(9) /=0
L2 can minimize the sum of the square of the differences between the target value and the estimate values. We use TensorFlow’ s function to
2019101222 05 Oct 2019 make L2 loss work. The function utilizes the norm of L2 to calculate tensor loss value.
(2) Dropout
Dropout is dropping some nodes randomly in fixed probability at fully connected layer in training. If we drop some nodes randomly, it may avoid overfitting. We use TensorFlow’s dropout function to drop nodes. The function can input tensor x and output value after applying drop ratio.
(3) Update
We have three methods to update loss and find the deep point.
1) The steepest descent method. We need to find a minimum value, so we can use the method to iteration search reversely in a fixed step size on the basis of current point gradient.
2) Momenta. Momenta gives a direction which combines with the steepest descent method in parallelogram law. And the final direction obeys the law.
3) Adam. Adam is one of the fastest algorithms in all methods. It can find the deep point effectively.
Finally, we use Adam to update loss. Adam is more effectively and wider than the other methods. For our thousands of data, Adam can optimization the problem directly.
Adam calculate gradient of time, the formula is:
2019101222 05 Oct 2019 g,=V.7(^_i)(10)
First, it calculates the exponential moving average of the gradient, the formula is:
wt = Aw-i + (lA)^(11)
Second, it calculates the exponential moving average of the square of the gradient, the formula is:
+Ο2
Third, we make yo and initialization which are 0, so we need to correct average of the gradient and average of the square of the gradient, the formula is:
=(13) =(I4
Finally, Adam updates the parameter.
(4) Learning rate optimization
We hope the learning rate is neither high nor low. TensorFlow’ s decay function can keep the learning rate. We use this function to make learning rate rise first then decay to make sure learning rate keeps medium. The learning rate will decay once in decay rate at a stable decay step. Our team use staircase decay mode, because its scale is larger, and it is easier to reduce outfitting.
(5) Framework optimization
In file ‘dp_refines.py’, models after training are saved to make us
2019101222 05 Oct 2019 record. We also use API to simplify the code so that we can change or modify the parameters easier.
Training and Testing
Firstly, the file is converted into the form of mat according to the ratio that train and test is 4:1 before the training. The amount of collected data is 8400x5x16 and initial size of each data is defined as [32x32x1], And then, to keep it from running out of memory, all the data cannot be processed simultaneously, we take the approach of batching data: define a variable of ‘batch size’ as the size of each batch, the computer only processes one batch of data at a time. We use 400 as the batch size of testing set, and 64 as the batch size of training set. Therefore, testing shape is [400x32x32x1], and training shape is [64x32x32x1], After being handled by the structure of convolution layers and max pooling layers, the size of output data is changed to [8x8x16], The loss generated during the training is what we use to judge the average accuracy rate of training and the value of loss depends on the parameters- base learning rate (initial valued 0.001). At the end of training, we adopt the method of ‘Adam’ to optimize result and reduce the loss generated. Besides, Softmax function gets the probability that each data belongs to each label. The progress of testing is basically same as training except data set. The accuracy of testing is also presented in the confusion matrix, which shows the accuracy rate of each label.
2019101222 05 Oct 2019
Table 1 shows the result when we take different values of parameters. We can achieve the optimal recognition accuracy of 70.23% when dropout rate is 0.95, base learning rate is 0.001, decay rate is 0.3, iteration steps is 20000, and λ is 5.Ox IO'4.
Table 1 Recognition Result
Dropout rate Base learning rate Decay rate Iteration steps λ average accuracy standard deviation
0.95 0.001 0.3 23000 5.00E-04 67.97 2.23
0.98 0.001 0.3 23000 5.00E-04 66.84 2.3
0.92 0.001 0.3 23000 5.00E-04 66.47 2.26
0.95 0.0018 0.3 23000 5.00E-04 67.35 2.26
0.95 0.0002 0.3 23000 5.00E-04 64.37 2.31
0.95 0.001 0.36 23000 5.00E-04 66.36 2.29
0.95 0.001 0.26 23000 5.00E-04 66.54 2.43
0.95 0.001 0.2 23000 5.00E-04 69.17 2.38
0.95 0.001 0.18 23000 5.00E-04 67.49 2.28
0.95 0.001 0.3 26000 5.00E-04 66.86 2.41
0.95 0.001 0.3 22000 5.00E-04 65.1 2.27
0.95 0.001 0.3 20000 5.00E-04 70.23 2.13
0.95 0.001 0.3 19000 5.00E-04 64.87 2.41
0.95 0.001 0.3 23000 5.00E-04 67.97 2.23
0.95 0.001 0.3 23000 6.00E-04 67.39 1.8

Claims (3)

1. A speaker recognition system based on deep learning, wherein use mel frequency cepstral coefficients to process audio data shape into 32x32x1, this approach makes data processing in subsequent convolutional networks more efficient and faster.
2. According to system of claim 1, wherein randomly divide the original data into training set and testing set proportionally, which ensures the rigor and reliability of the experimental results.
3. According to system of claim 1, wherein introduces regularization and dropout to avoid overfitting, the regularization makes the neural network intend to learn smaller weights, the dropout helps lessen dependence between neurons which makes the network more robust, by optimizing the parameters, our invention can effectively identify the speakers.
AU2019101222A 2019-10-05 2019-10-05 A Speaker Recognition System Based on Deep Learning Ceased AU2019101222A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2019101222A AU2019101222A4 (en) 2019-10-05 2019-10-05 A Speaker Recognition System Based on Deep Learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2019101222A AU2019101222A4 (en) 2019-10-05 2019-10-05 A Speaker Recognition System Based on Deep Learning

Publications (1)

Publication Number Publication Date
AU2019101222A4 true AU2019101222A4 (en) 2020-01-16

Family

ID=69146726

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2019101222A Ceased AU2019101222A4 (en) 2019-10-05 2019-10-05 A Speaker Recognition System Based on Deep Learning

Country Status (1)

Country Link
AU (1) AU2019101222A4 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053398A (en) * 2021-03-11 2021-06-29 东风汽车集团股份有限公司 Speaker recognition system and method based on MFCC (Mel frequency cepstrum coefficient) and BP (Back propagation) neural network
CN113642385A (en) * 2021-07-01 2021-11-12 山东师范大学 Deep learning-based facial nevus identification method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053398A (en) * 2021-03-11 2021-06-29 东风汽车集团股份有限公司 Speaker recognition system and method based on MFCC (Mel frequency cepstrum coefficient) and BP (Back propagation) neural network
CN113053398B (en) * 2021-03-11 2022-09-27 东风汽车集团股份有限公司 Speaker recognition system and method based on MFCC (Mel frequency cepstrum coefficient) and BP (Back propagation) neural network
CN113642385A (en) * 2021-07-01 2021-11-12 山东师范大学 Deep learning-based facial nevus identification method and system
CN113642385B (en) * 2021-07-01 2024-03-15 山东师范大学 Facial nevus recognition method and system based on deep learning

Similar Documents

Publication Publication Date Title
US6253179B1 (en) Method and apparatus for multi-environment speaker verification
AU2019101222A4 (en) A Speaker Recognition System Based on Deep Learning
CN109493886A (en) Speech-emotion recognition method based on feature selecting and optimization
CN110390955A (en) A kind of inter-library speech-emotion recognition method based on Depth Domain adaptability convolutional neural networks
CN111429947B (en) Speech emotion recognition method based on multi-stage residual convolutional neural network
CN108804677A (en) In conjunction with the deep learning question classification method and system of multi-layer attention mechanism
CN107993663A (en) A kind of method for recognizing sound-groove based on Android
CN104036776A (en) Speech emotion identification method applied to mobile terminal
Ghai et al. Emotion recognition on speech signals using machine learning
CN109740734B (en) Image classification method of convolutional neural network by optimizing spatial arrangement of neurons
Zahorian et al. Phone classification with segmental features and a binary-pair partitioned neural network classifier
CN112906889A (en) Method and system for compressing deep neural network model
CN113539293A (en) Single-channel voice separation method based on convolutional neural network and joint optimization
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
Fan et al. Soundscape emotion recognition via deep learning
Tang et al. Two improved architectures based on prototype network for few-shot bioacoustic event detection
CN114678030A (en) Voiceprint identification method and device based on depth residual error network and attention mechanism
Li A Speaker Recognition System Based on Deep Learning
CN114299995A (en) Language emotion recognition method for emotion assessment
CN114091529A (en) Electroencephalogram emotion recognition method based on generation countermeasure network data enhancement
Liu et al. Birdcall identification using mel-spectrum based on ResNeSt50 model
Chun et al. Research on music classification based on MFCC and BP neural network
CN106486112A (en) Rhythm boundary detection method based on tone core acoustic feature and deep neural network
Bakshi et al. Spoken Indian language classification using GMM supervectors and artificial neural networks
CN113327630B (en) Speech emotion recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry