AU2020102038A4 - A speaker identification method based on deep learning - Google Patents
A speaker identification method based on deep learning Download PDFInfo
- Publication number
- AU2020102038A4 AU2020102038A4 AU2020102038A AU2020102038A AU2020102038A4 AU 2020102038 A4 AU2020102038 A4 AU 2020102038A4 AU 2020102038 A AU2020102038 A AU 2020102038A AU 2020102038 A AU2020102038 A AU 2020102038A AU 2020102038 A4 AU2020102038 A4 AU 2020102038A4
- Authority
- AU
- Australia
- Prior art keywords
- data
- training
- speaker
- deep learning
- cnn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000013135 deep learning Methods 0.000 title claims abstract description 14
- 238000000034 method Methods 0.000 title claims description 20
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 25
- 238000001514 detection method Methods 0.000 claims abstract description 7
- 230000000694 effects Effects 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 abstract description 25
- 238000012360 testing method Methods 0.000 abstract description 12
- 238000012545 processing Methods 0.000 abstract description 6
- 238000006243 chemical reaction Methods 0.000 abstract description 3
- 238000002360 preparation method Methods 0.000 abstract 1
- 238000013528 artificial neural network Methods 0.000 description 9
- 238000005457 optimization Methods 0.000 description 9
- 238000011176 pooling Methods 0.000 description 7
- 230000004913 activation Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Complex Calculations (AREA)
Abstract
This invention lies in the field of digital audio processing, which is a speech recognition
system for identifying different identities based on deep learning. The invention consists of
the following steps: First of all, the preparation of sufficient data was made, and the data
was also split into training and testing data. Secondly, we preprocessed the data by using
Voice Activity Detection (VAD) for detecting the effective audio segments, and
Mel-frequency cepstral coefficients (MFCC) for feature extraction. Then, the training data
were batched into the Convolutional Neural Network (CNN) that we have already set.
Simultaneously, the parameters in the CNN were adjusted, namely dropout rate, learning
base rate, loss rate, in order to optimize the performance of the model. Eventually, the
optimal CNN can be used for the testing data, and the identities can be recognized with an
accuracy of 92.6%. In brief, the identity of the speaker can be recognized automatically
without human involvement by this invention.
1
Raw Data
Mono
conversion
Voice Activity
Detection
Audio
Segment
transform to
.mat format
Training
Data
Figure 1
1
Description
Raw Data
Mono conversion
Voice Activity Detection
Audio Segment
transform to .mat format
Training Data
Figure 1
TITLE A speaker identification method based on deep learning
FIELD OF THE INVENTION This invention lies in the field of digital audio processing and serves as speakers'identities recognition powered by deep learning.
BACKGROUND OF THE INVENTION Human speech is produced by a complex biophysical process of reactions between the human speech center and the vocal organs. Vocal organs such as the tongue, teeth, throat, lungs and nose vary in size and shape from person to person, and the sonic map of any two individuals will have some variation. In the same way as other biometric authentication technologies, speech recognition has the advantages of being unforgettable, requiring no memory and being easy to use. In the field of biometric authentication technology, speech recognition technology has received widespread attention for its unique convenience, economy and accuracy, and it is becoming an important and popular security authentication method in people's daily life and work. In the past two decades, speech recognition technology has madesignificant progress and is beginning to move from the lab to the marketplace. To be more specific, Sir that is developed by Apple, and Alexa which is invented by Amazon are becoming prevailing in humans' daily lives, and both of them have implemented speaker recognition, so that it can recognize the identity of the speaker. In terms of the conventional automatic speech recognition methods, the effective and rapid adaption methods to specific accents and corruptions are still an issue that needed to be solved. To solve this problem, deep learning is utilized for automatic speech recognition. Overfitting to training data, however,will become a problem to be solved. Furthermore, naively trained Deep Neural Networks are usually treated as "black boxes" and they are also difficult to interpret directly due to the highly distributed representation. From a general view, the speaker recognition can be mainly divided into three categories according to the applications, including speaker identification, speaker verification, and speaker diarization. The three phases of speaker verification can be briefly described as the training phase, enrollment phase, and evaluation phase. In addition, the type of speaker verification system is based on the type of data used for enrollment and recognition, which are text-dependent and text-independent mode of operations. From the technical perspective, the characteristics or features in the audio, which are different among speakers, and used for speaker authentication, surveillance, and forensics, can be analyzed by the speech recognition. Specifically, the process of checking the authenticity of the speaker can be described as the comparison between the speaker's speech and template speech patterns of many speakers already enrolled in the system. The general technique for Voice activity detection (VAD) is dual-threshold endpoint detection that is based on short-time energy and zero-crossing rate. By applying this technique, the effective speech segments can be recognized and detected. Mel-Freguency Ceptral Coefficients (MFCC) is a widely used feature in automatic speech recognition, which was developed by Davis and Mermelstein in 1980. The process of extracting MFCC features can be concluded into seven phases, namely pre-emphasis, framing and windowing; fast Fourier transform; take the absolute or square value; Mel filter; logarithm and eventually, Delta MFCC. Deep Neural Network (DNN) was first introduced in 1960s, but the first practical application was not introduced until the late 1980s, which is LeNet for recognizing handwritten numbers. The breakthrough in DNN-based speech recognition, however, was not made until 2011. The great success of deep learning around 2010 is mainly due to three factors: the vast amount of information required to train the network, the availabilty of sufficient computing resources, and the evolution of algorithm, which has greatly improved the accuracy and broadened the scope of applications of DNN. Technically, DNNsintroduce multiple layers of non-linear processing units, which allow complex data to be well modelled. While the representations in a multi-layer configuration from raw data representation can be yielded automatically. Therefore, the signal spectrograms can be directly used in general purpose learning algorithm as raw features. In this invention, we concentrate on recognizing the identity of the speaker by using a text-independent speech recognition system based on deep learning. To solve the previous issues in the traditional speech recognition system, we plan to optimize the architecture through regularization, dropout and the number of iterations. Additionally, TensorFlow will also be used in this invention as the deep learning framework forimplementing the model, since it is an end-to-end open source machine learning platform with a comprehensive and flexible ecosystem of tools, libraries and community resources.
SUMMARY OF THE INVENTION In order to implement automatic audio feature extraction and solve the shortcomings and deficienies in the existing technology, this invention proposes a speech recognition method for quickly identifying the identity of the speaker based on deep learning. From an overall perspective, the procedure of developing our system can be divided into five parts, which are data processing, architecture design, architecture training, optimization and testing. Our deep learning speech recognition invention also consists of several components, which are speech segment database, convolutional neural networks, parameter optimization and implementation of speech recognition.
Data processing The speech segment database was established by collecting our voice records. Then, we preprocessed all the audio segments by converting them to mono and splitting them into identical length for about seven or eleven seconds long. After the separation of audio segments, they are divided into training data set and testing data set. Meanwhile, Voice Activity Detection (VAD) is applied to all the segments for detecting the validated fragment, to be more specific, checking the start point and the end point of the segment. In this case, the fragment with voice will be detected and split out for feature extraction, which will be done by applying Mel-frequency cepstral coefficients (MFCC). Eventually, the convolutional neural network will be implemented according to our design, with the extracted features as its input.
Architecture design Deep learning was applied for this speech recognition system, therefore, the architecture of this system is designed with three convolutional layers followed by two fully connected layers, and its architecture is shown in Figure 4. To take full benefits of automatic feature extraction in deep learning, the layer-by-layer initializing training mode was adopted. As a result, the training time can beimproved and the technical difficulties, for example, overfitting, will also be overcome during the training phase. The difficulties of audio feature extraction and automatic audio recognition will be solved by applying this method.
Convolutional neural network (CNN) A basic Convolutional neural network (CNN) consists of three structures, including convolution, activation and pooling. So, the CNN of this invention is a sequence of layers, specifically, three convolutional layers. As for the convolutional layer, it consists of several convolutional units, and the first convolutional layer may only be able to extract some low level feature so that the following layers can extract more complex features from the low level features iteratively. The most general operation in CNN is convolution, which uses convolutional kernel for extracting features from input image. A convolutional kernel is a matrix that slides over an image and multiplies it with input to enhance the output as expected, meanwhile, sliding window is used. The output of the system at a certain time is the result of the interaction of multiple inputs, then the size of the output data can be calculated as the formula (1) shown below, where W is the size of the input unit, Fis receptive field, S is stride, and P is zero-padding. W-F+2P S
Fully connected neural network Besides the convolutional layers, the fully connected layers play a role of a "classifier" in the entire neural network. Unlike convolutional neural network, each node in fully connected neural network is connected to all the nodes of the previous layer and is used to synthesize the features extracted previously and map the input image to the label set, namely classify.
Pooling Pooling is a subsampling operation whose primary goal is to reduce the feature map's feature space or, perhaps, to reduce the feature map's resolution. Since there are too many feature map parameters, details are not beneficial for extracting high-level features. The pooling layer in our invention is sandwiched between successive convolutional layers and is used to compress the amount of data and parameters and reduce overfitting.
Activation function After convolution, bias is usually implemented and a non-linear activation function is introduced, where bias is defined as B and activation function is h(). The result after activation can be indicated in formula (2). z,.=h *WiVi +b) (2)
Architecturetraining Convolutional neural network is essentially an input-output mapping. It can learn a large number of mapping relationships between inputs and outputswithout requiring any exact mathematical expressions between them. As long as the convolutional neural network is trained in a specific pattern, the network will have the ability to map between other inputs and outputs. The training phase of our convolutional neural network can be concluded into two phases, which are forward propagation phase and backward propagation phase. In the first phase, a sample is taken from the sample set and input into the network, then the corresponding actual output can be calculated. In this phase, data is transferred from the input layer to the output layer through a step-by-step transformation. In terms of the second phase of backward propagation, the error between the actual output and the corresponding ideal output is calculated, then the gradient of each weightwill also be calculated. Eventually, the weight can be updated byimplementing gradient descent algorithm.
Optimization In terms of the optimization of the convolutional neural network, the training data set was put in batches into the convolutional neural network, for reducing the loss function. Stochastic gradient descent optimization (SGD) and Adam were also chosen, so that it can reduce the stress on the machine and allow for faster convergence. Toeliminate overfitting, L2 regularization was implemented, because of its uniqueness. Besides the regularization, dropout also was used for reducing overfitting, which means some nodes can be discarded with a given probability in the fully connected layer.
Regulation Regularization is an effective way to avoid overfitting models and ensure generalization ability by displaying control model complexity in machine learning. Regularization introduces a model complexity index into the loss function and weakens the noise in the training data by using W-weighted values. The regularization that has been used in this invention is L2 regularization, it adds the sum of squares of the weight parameters to the original loss function. The formula for calculation is shown in the formula (3), where E is the training sample error without the regularization item, and A is only used for fully connected neural network.
L = E + A jw2 (3)
Dropout Dropout is a network regularization method currently used by almost all deep convolution neural networks with fully connected layers. The idea behind dropout is that, for each neuron in a certain layer, the weight of the neuron is randomly reset to 0 with probability p in the training phase. All neurons are activated in the test phase, but their weights need to be multiplied by (1-p) to ensure that all weights have the same expectation in the training and testing stages. In this invention, dropout is applied for both input layer and hidden layer in the system.
Aadm Adam optimization algorithm is an extension of stochastic gradient descent method (SGD). What makes it different from SGD is the combination of advantages of two extensions of SGD, namely Adaptive gradient algorithm (AdaGrad) and Root mean square propagation (RMSProp). AdaGrad maintains the learning rate of one parameter, which can improve the performance on sparse gradient problems. RMSProp also maintains the learning rate of each parameter, which is adjusted according to the average value of the nearest weight gradient, such as the rate of change, which means that the algorithm performs well on line and non stationary problems. Therefore, Adam is also used for optimization in ourinvention.
Testing After the training phase and optimization, the invention can be able to recognize the identity of the speaker and ready for testing. As a result, thisinvention can recognize the speaker through a seven seconds long(at least) voice clip quickly.
DESCRIPTION OF DRAWINGS Figure 1 is the flow block diagram of data preprocessing. Figure 2 is the flow block diagram of model training. Figure 3 is the flow block diagram of trained model making predictions. Figure 4 is the architecture of the convolutional neural network in ourinvention. Figure 5 indicated the general structure of the deep neural network.
Design The architecture of the convolutional neural network is shown in Figure 4. It has three convolutionallayers, after each convolutional layer is a maximum pooling layer, and uses a flatten layer and two fully connected layers to form the entire neural network. Our network architecture is inspired by the good features of convolutional neural networks in processing multidimensional data. We collected about 180 minutes of audio data from two different individuals, and cut them into 11-second segments for data storage. Figure 1 display the procedure of how we preprocess the data, which applies to all training and validation data. Our network has three convolutional layers followed by two fully connected layers. Each convolutional layer consists of three components: convolution, batch normalization, and pooling. Convolution as itself is a 3x3 kernel (the entries are learned during training) that filters across the entire audio data. The stride is set to be one and zero padding is added to preserve the peripheral information. Right after convolution, there is a pooling to down-sample the previous output. It is then followed by a ReLU activation. These three components are grouped together in this order and repeated three times before the flatten layer to transform the 2D array into a iD vector. This vector is then fed into two consecutive fully connected layers to produce the final prediction.
Procedure The procedure of this invention can bedivided into the following steps.
Step 1: Data collection In this invention, all the training data(180 minutes) are voice records only, and some of the data is added with Gauss noise. For all the voice records that have been collected, the separation was implemented to get audio segments with identical length. In total, each speaker has approximately 445 audio segments for training and over 250 voice clips for testing, and each segment has seven or eleven seconds.
Step 2: Data preprocessing 1) Converting all the audio segments to mono. 2) Cutting all audio into eleven-second segments. 3) Extracting MFCC features of each audio segment into a numpy array. 4) Transforming the numpy array to the format of '.mat' for further transformation. 5) Transforming the shape of the data: The data of the audio segments will be reshaped to a matrix of 32x32x26 for the later calculationsin the convolutional neural network. 6) One hot encoding: The name of the speaker is shown in each label is named by different arrays, so that it can be used for classifying the speaker. Step 3: Training the optimization We use CNN model which contains three convolutional layers and two fully connected layers:
Layer (type) Output Shape Param #
conv2d (Conv2D) (None,30,30,32) 7520 max pooling2d (None, 15,15, 32) 0 (MaxPooling2D) conv2d_1(Conv2D) (None,13,13,32) 9248 maxpooling2d_1 (None, 6, 6, 32) 0 (MaxPooling2D) conv2d_2(Conv2D) (None,4,4,128) 36992 maxpooling2d_2 (None, 2, 2, 128) 0 (MaxPooling2D) flatten (Flatten) (None, 512) 0 dense (Dense) (None,256) 131328 dense_1(Dense) (None, 64) 16448 dense_1(Dense) (None,2) 130
Step 4: Testing
We take more than 500 unlabeled audio data as input, 7-second segment for multichannel and 11-second segment for mono, and evaluate all nodes in the CNN to output final prediction. In the first attempt, our recognition results are as follows:
Test Data Type Result(Recognition Rate) Multichannel, No-noise 100% Mono, No-noise 100% Multichannel, Gauss-noise 100% Mono, Gauss-noise 70% Overall 92.6%
Then we adjusted some parameters, and the final key parameters were: Optimizer: AdamOptimizer Learning-rate: 0.001 Kernel_Initializer: truncatednormal_initializer(stddev=0.01) Epoch: 6 Batch Size: 10
And the adjusted training results are as follows:
Test Data Type Result(Recognition Rate) Multichannel, No-noise 100% Mono, No-noise 100% Multichannel, Gauss-noise 100% Mono, Gauss-noise 100% Overall 100%
Claims (2)
1. A speaker identification method based on deep learning, characterized in that, a method for speech recognition based on deep learning, which includes Voice Activity Detection for detecting sufficient audio segment, Mel-frequency cepstral coefficients for feature extraction, and Convolutional Neural Network for identifying the identity of the speaker.
2. According to method of claim 1, wherein has a simple architecture, which can be applied to speaker identification quickly; as the speech time of our samples is not very long, our invention can also be applied to simple and rapid speaker identification.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2020102038A AU2020102038A4 (en) | 2020-08-28 | 2020-08-28 | A speaker identification method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2020102038A AU2020102038A4 (en) | 2020-08-28 | 2020-08-28 | A speaker identification method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2020102038A4 true AU2020102038A4 (en) | 2020-10-08 |
Family
ID=72663853
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2020102038A Ceased AU2020102038A4 (en) | 2020-08-28 | 2020-08-28 | A speaker identification method based on deep learning |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2020102038A4 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113345461A (en) * | 2021-04-26 | 2021-09-03 | 北京搜狗科技发展有限公司 | Voice processing method and device for voice processing |
CN113380255A (en) * | 2021-05-19 | 2021-09-10 | 浙江工业大学 | Voiceprint recognition poisoning sample generation method based on transfer training |
CN113611314A (en) * | 2021-08-03 | 2021-11-05 | 成都理工大学 | Speaker identification method and system |
CN113627300A (en) * | 2021-08-02 | 2021-11-09 | 中电福富信息科技有限公司 | Face recognition and living body detection method based on deep learning |
CN114495947A (en) * | 2022-03-04 | 2022-05-13 | 蔚来汽车科技(安徽)有限公司 | Method and apparatus for detecting voice activity |
CN115171700A (en) * | 2022-06-13 | 2022-10-11 | 武汉大学 | Voiceprint recognition voice assistant method based on pulse neural network |
WO2023036016A1 (en) * | 2021-09-07 | 2023-03-16 | 广西电网有限责任公司贺州供电局 | Voiceprint recognition method and system applied to electric power operation |
-
2020
- 2020-08-28 AU AU2020102038A patent/AU2020102038A4/en not_active Ceased
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113345461A (en) * | 2021-04-26 | 2021-09-03 | 北京搜狗科技发展有限公司 | Voice processing method and device for voice processing |
CN113380255A (en) * | 2021-05-19 | 2021-09-10 | 浙江工业大学 | Voiceprint recognition poisoning sample generation method based on transfer training |
CN113380255B (en) * | 2021-05-19 | 2022-12-20 | 浙江工业大学 | Voiceprint recognition poisoning sample generation method based on transfer training |
CN113627300A (en) * | 2021-08-02 | 2021-11-09 | 中电福富信息科技有限公司 | Face recognition and living body detection method based on deep learning |
CN113611314A (en) * | 2021-08-03 | 2021-11-05 | 成都理工大学 | Speaker identification method and system |
WO2023036016A1 (en) * | 2021-09-07 | 2023-03-16 | 广西电网有限责任公司贺州供电局 | Voiceprint recognition method and system applied to electric power operation |
CN114495947A (en) * | 2022-03-04 | 2022-05-13 | 蔚来汽车科技(安徽)有限公司 | Method and apparatus for detecting voice activity |
CN115171700A (en) * | 2022-06-13 | 2022-10-11 | 武汉大学 | Voiceprint recognition voice assistant method based on pulse neural network |
CN115171700B (en) * | 2022-06-13 | 2024-04-26 | 武汉大学 | Voiceprint recognition voice assistant method based on impulse neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2020102038A4 (en) | A speaker identification method based on deep learning | |
Jahangir et al. | Text-independent speaker identification through feature fusion and deep neural network | |
Liu et al. | GMM and CNN hybrid method for short utterance speaker recognition | |
Lian et al. | Speech emotion recognition via contrastive loss under siamese networks | |
Ding et al. | Autospeech: Neural architecture search for speaker recognition | |
CN109949824B (en) | City sound event classification method based on N-DenseNet and high-dimensional mfcc characteristics | |
CN106952649A (en) | Method for distinguishing speek person based on convolutional neural networks and spectrogram | |
Ohi et al. | Deep speaker recognition: Process, progress, and challenges | |
CN112581979A (en) | Speech emotion recognition method based on spectrogram | |
CN105206270A (en) | Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM) | |
CN113312989B (en) | Finger vein feature extraction network based on aggregated descriptors and attention | |
CN115862684A (en) | Audio-based depression state auxiliary detection method for dual-mode fusion type neural network | |
CN108364662A (en) | Based on the pairs of speech-emotion recognition method and system for differentiating task | |
CN112329819A (en) | Underwater target identification method based on multi-network fusion | |
CN111986699A (en) | Sound event detection method based on full convolution network | |
CN116758451A (en) | Audio-visual emotion recognition method and system based on multi-scale and global cross attention | |
CN117746908A (en) | Voice emotion recognition method based on time-frequency characteristic separation type transducer cross fusion architecture | |
CN117854473B (en) | Zero sample speech synthesis method based on local association information | |
Zheng et al. | MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios | |
Alashban et al. | Speaker gender classification in mono-language and cross-language using BLSTM network | |
Chinmayi et al. | Emotion Classification Using Deep Learning | |
JPH09507921A (en) | Speech recognition system using neural network and method of using the same | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
Flower et al. | A novel concatenated 1D-CNN model for speech emotion recognition | |
CN117976006A (en) | Audio processing method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |