CN116863939A - Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED - Google Patents
Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED Download PDFInfo
- Publication number
- CN116863939A CN116863939A CN202310826924.6A CN202310826924A CN116863939A CN 116863939 A CN116863939 A CN 116863939A CN 202310826924 A CN202310826924 A CN 202310826924A CN 116863939 A CN116863939 A CN 116863939A
- Authority
- CN
- China
- Prior art keywords
- information
- processing
- matrix
- spectrogram
- loss function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000007246 mechanism Effects 0.000 title claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 62
- 230000004927 fusion Effects 0.000 claims abstract description 47
- 238000001228 spectrum Methods 0.000 claims abstract description 10
- 230000001131 transforming effect Effects 0.000 claims abstract description 10
- 238000013507 mapping Methods 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 90
- 239000011159 matrix material Substances 0.000 claims description 62
- 230000010365 information processing Effects 0.000 claims description 26
- 238000004364 calculation method Methods 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 238000011282 treatment Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 12
- 238000012549 training Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000008034 disappearance Effects 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 210000000214 mouth Anatomy 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 210000000554 iris Anatomy 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 210000003928 nasal cavity Anatomy 0.000 description 1
- 230000008288 physiological mechanism Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000003014 reinforcing effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a voiceprint recognition method of DenseNet-LSTM-ED based on global attention mechanism, which belongs to the technical field of voiceprint recognition, and comprises the steps of firstly, dividing a voice signal, windowing, fourier transforming, energy density spectrum, logarithmic transforming and color mapping to obtain a corresponding spectrogram of the voice signal; then respectively processing the spectrograms by using a DenseNet module, an LSTM unit and an ED module, fusing the processing results of the DenseNet module and the LSTM unit to form space-time fusion information, and processing the spectrograms by using the ED module to form enhancement information; and then, the space-time fusion information and the enhancement information are fused to form space-time enhancement information, different weights are given to the space-time enhancement information by using a global attention mechanism, so that the contribution degree of key frame voice to a recognition result is increased, and the classification of speaker recognition is realized by a mode of joint supervision of a Softmax loss function and a centrloss function.
Description
Technical Field
The application relates to the technical field of voiceprint recognition, in particular to a voiceprint recognition method based on DenseNet-LSTM-ED of a global attention mechanism.
Background
Speaker recognition, also known as voiceprint recognition, is a technique that recognizes and identifies the identity of a speaker by sound. Voiceprints are characteristic parameters extracted from speech that can characterize the identity of a speaker. Voiceprints are similar to biological features such as irises or DNA, and because physiological mechanisms such as nasal cavities and oral cavities of each person are different, different speakers have different pronunciation modes and pronunciation habits, and even through imitation, the essential features with the identity information of the speakers cannot be simulated. Thus, voiceprint recognition can distinguish people with different identities through the characteristics, so that the identity of a speaker can be judged. With the rapid development of deep learning technology, the voiceprint recognition under deep learning has stronger feature representation capability compared with the traditional voiceprint recognition method, and can extract abstract features with higher dimensionality from voice. In the current speaker recognition algorithm based on deep learning, most algorithms only consider the space characteristics or time domain characteristics of voice, and the model training difficulty is high and the accuracy is low.
Disclosure of Invention
Based on this, there is a need to address these technical problems, and the present application provides a voiceprint recognition method based on DenseNet-LSTM-ED of global attention mechanism, which includes the following steps:
s100: obtaining a spectrogram corresponding to the voice signal through dividing, windowing, fourier transforming, energy density spectrum, logarithmic transforming and color mapping the voice signal;
s200: taking the spectrogram obtained in the step S100 as input, inputting the input into a DenseNet module for spatial feature extraction, and obtaining the spatial information of the voice signal;
s300: the information of the spectrogram obtained in the step S100 is copied and sent to the LSTM units, and after t LSTM units are passed, the time sequence information of the voice signal is conveniently and fully extracted;
s400: the information of the spectrogram obtained in the step S100 is copied and sent to an ED module, the ED module comprises deconvolution processing of the information of the spectrogram, trend information processing of the information of the spectrogram, fusion of the deconvolution processed information and the trend information processed information, and convolution processing of the fused deconvolution processed information and the trend processed information to generate enhancement information;
s500: and splicing processing results of the DenseNet module and the LSTM unit to form space-time fusion information, carrying out information fusion on the space-time fusion information and enhancement information ED to form space-time enhancement information, giving different weights to the space-time enhancement information by using an attention mechanism, forming a total loss function by combining a Softmax loss function and a CenterLoss loss function, and identifying the category of the voiceprint by using the total loss function.
Further, the step 200 includes: the DenseNet module comprises: 1 initial convolution, N Dense connection modules, dense Block, multiple transport layer transitions, the Dense connection modules Dense Block comprising x0, x1, xl-1, xl; x0, x1,) the term xl-1, xl is layer 0, layer 1, feature map of layer i, and the feature map of layer i is spliced by the feature maps of each layer, and the spliced feature information H is obtained through nonlinear transformation Hl (x) l ([x 0 ,x 1 ,......x l-1 ]) Splicing characteristic information H l ([x 0 ,x 1 ,......x l-1 ]) Feature map x of the first layer is obtained through feature mapping of an activation function gamma (x) l ,x l The calculation formula is shown as follows:
x l =γ(H l ([x 0 ,x 1 ,......x l-1 ]))
wherein γ (x) represents the activation function, wherein λ 1 ,λ 2 Is a multiplier factor, andis not an integer.
Further, the step 400 includes: step S401: deconvolution processing is carried out on the information of the spectrogram;
Ot=s1*(a1-1)+k1-2*p1
wherein a1 is a matrix of spectrogram pixel points; s1 is the length of each shift of the convolution kernel; k1 is the size of the convolution kernel, and when the size of the convolution kernel is not matched with the size of the spectrogram matrix a1, p1 is the first filling matrix; when the size of the convolution kernel is matched with the size of the spectrogram matrix a1, p1 is 0; ot is the information matrix after deconvolution;
step S402: trend information processing is carried out on the information of the spectrogram;
in the matrix of the spectrogram pixel points, the trend information of the pixel point positions is obtained through numerical calculation of k periods near the position coordinates (i, g) of each pixel point, and the calculation formula is as follows:
wherein ,the trend information of the pixel point position coordinates (i, g), x (i, g) is the original information of the pixel point position coordinates (i, g), x (i, g+j) is the original information of the pixel point position coordinates (i, g+j), j epsilon (-k, k), and k is a positive integer;
calculating the position of each pixel point of the spectrogram through the calculation formula to obtain an information matrix Dt after trend information processing of the spectrogram, wherein the formula of the Dt is as follows:
wherein n is a positive integer, and m is a positive integer;
step S403: fusing the information after deconvolution processing and the information after trend information processing;
the information matrix obtained after deconvolution processing is Ot, the information matrix obtained by trend information processing is Dt, and the information obtained after deconvolution processing and the information obtained after trend information processing are fused to form an information fusion matrix OD:
wherein ,for the equilibrium parameter of the information matrix obtained after deconvolution processing being Ot, r is a trend scaling factor, and is used for controlling the size of the information matrix Dt obtained by trend information processing;
step S404: the information matrix OD of the information after the fused deconvolution processing and the information after the trend processing are subjected to convolution processing to generate enhanced information, the information fusion matrix OD is used as input of the convolution processing, the information fusion matrix OD is subjected to feature extraction, and the calculation formula is as follows:
where ED is enhancement information, k2 is the size of the convolution kernel, s2 is the step of the convolution kernel moving, and when the size of the convolution kernel is not matched with the OD size of the information fusion matrix, p2 is the second filling matrix; when the size of the convolution kernel matches the size of the information fusion matrix OD, p2 is 0.
Further, the step 500 includes:
s501: splicing the processing results of the DenseNet module and the LSTM unit to form space-time fusion information, and carrying out information fusion on the space-time fusion information and the enhancement information to form space-time enhancement information;
s502: the weight given by the attention mechanism is used for the space-time enhancement information, the category of the voiceprint is identified by utilizing key frame voice, and the category of the voiceprint is predicted;
s503: using the Softmax Loss function in combination with the Center Loss function to form a total Loss function, the total Loss function calculating a difference between a true value of the class of voiceprints and a predicted value of the predicted voiceprint class to obtain a Loss value;
s504: judging whether the loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process advances to step S505.
S505: giving new weight to the space-time enhancement information by using an attention mechanism, and identifying the category of the voiceprint again by utilizing key frame voice to predict the category of the voiceprint again;
s506: combining the Softmax Loss function with the Center Loss function again to form a total Loss function, and calculating the difference between the true value of the category of the voiceprint and the predicted value of the category of the voiceprint predicted again by the total Loss function to obtain a new Loss value;
s507: judging whether the new loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process goes to step S505 to step S507 again until the loss value is equal to the preset value, and the classification of the voiceprint is completed.
Further, the space-time fusion information and the enhancement information are added to obtain space-time enhancement information M, wherein the space-time enhancement information M is formed by a plurality of sub-space enhancement information M1, M2, mt. with the same dimension, wherein t is a positive integer, s is a positive integer, a global attention mechanism is provided, and different weights are given to the space-time enhancement information M, and the specific process is as follows:
computing sub-spatiotemporal enhancement information m t And sub-space enhancement information m i Similarity alpha ti ,α ti The calculation formula is as follows:
its corresponding scoring function is as follows:
then the vector C is obtained by means of weighted average t :
Further, the sub-space enhancement information m is added t Sum vector C t Performing head-tail splicing processing, multiplying the processing by a weight matrix Wc of an attention mechanism, and calculating space-time enhancement information based on a global attention mechanism
Wherein tan h is a hyperbolic tangent function;
finally, normalizing the attention weight value by using a softmax classification layer:
where p is the probability size, since normalization is to limit the value to 0-1]Between [0-1 ]]The numbers between the ranges may represent probabilities, wt isCorresponding feature weights.
Further, the step S500 includes: the total loss function is as follows:
wherein L represents the total loss function, L s Represent Sofimax loss function, L c Represents the Center Loss function, λ is a factor used to balance the 2 Loss functions.
Further, the Sofimax loss function is as follows:
wherein ,xi Representing the ith feature, y i Is x i Is a true category label of (2);and W is equal to j Respectively represent x i Is distinguished as y i Class and j-th class weight vectors; b yi And bj represents y respectively i Class and j bias term; m represents the size of the small lot.
Further, the Center Loss function is as follows:
wherein ,Cyi Represents the y i Class center of class feature, X i Representing the ith feature, m represents the size of the small lot.
The beneficial technical effects of the application are as follows:
(1) Dividing, windowing, fourier transforming, energy density spectrum, logarithmic transforming and color mapping the voice signal to obtain a corresponding spectrogram of the voice signal; then respectively processing the spectrograms by using a DenseNet module, an LSTM unit and an ED module, fusing the processing results of the DenseNet module and the LSTM unit to form space-time fusion information, and processing the spectrograms by using the ED module to form enhancement information; then, the space-time fusion information and the enhancement information are fused to form space-time enhancement information, and different weights are given to the space-time enhancement information by using a global attention mechanism so as to increase the contribution degree of key frame voice to the recognition result; finally, the classification of speaker identification is realized by means of joint supervision of the Softmax loss function and the centrloss function.
(2) The spectrogram is used as an input form of a voice signal, the space characteristic information and time sequence characteristic information which are contained in the spectrogram and change along with time are reserved, the space characteristic information is taken into consideration, and the richness of the input information is ensured; according to the application, the DenseNet module, the LSTM unit and the ED module are respectively used for processing information input from a spectrogram, so that the original sequence characteristic of voice is fully reflected by the characteristic reusability and the characteristic of convolution accumulation, the LSTM unit controls the flow of information by a gating mechanism, the sequential characteristic between the front unit and the rear unit is greatly memorized, one part of information is transmitted to the three modules by adopting a parallel connection mode to respectively excavate characteristic information, the side emphasis of the information is different by the three modules, the multi-dimensional information characteristic excavation is realized, the characteristic extraction and combination of different dimensions are realized on voice signals, and the recognition accuracy is improved.
(3) Introducing an attention mechanism and improving the loss function. The method comprises the steps of fusing local spatial features and time features extracted through two networks and reinforcing information, distributing attention to the fused information in a probability distribution weight mode, focusing attention on effective information, training in a mode of jointly supervising a Softmax Loss function and a Center Loss function, expanding compactness in a class and separability between classes, and effectively improving identification accuracy.
Drawings
Fig. 1 is a diagram of a spectrogram generation process.
Fig. 2 is a block diagram of the DenseNet module.
Fig. 3 is a Block diagram of the Dense Block module.
Fig. 4 is a structural diagram of the LSTM cell.
FIG. 5 is a flow chart of a method for voiceprint recognition based on DenseNet-LSTM-ED by a global attention mechanism.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the disclosed embodiments and are not intended to limit the disclosed embodiments.
The structures, proportions, sizes, etc. shown in the drawings are shown only in connection with the present disclosure, and are not intended to limit the scope of the application, since any modification, variation in proportions, or adjustment of the size, etc. of the structures, proportions, etc. should be considered as falling within the spirit and scope of the application, without affecting the effect or achievement of the objective. Also, the terms "and," or "and the like recited in the present specification are for convenience of description only and are not intended to limit the scope of the application, which is defined by the terms" and "or" and their relative relationships, without any substantial change to the technical content, as they are also considered to be within the scope of the application which is practicable; in addition, the embodiments of the present application are not independent of each other, but may be combined.
The speech spectrum is a two-dimensional image formed by integrating the frequency domain characteristics of the speech signal over the time domain, and dynamically displays the relationship between the speech spectrum and time. The spectrogram comprises spatial characteristic information formed by corresponding time frequency and energy intensity and time sequence characteristic information changing along with time, different textures are formed according to the color depth, a large amount of personal characteristic information of speakers is contained in the textures, and different speakers can be identified according to the difference of the spectrogram textures.
In the generation process of the spectrogram shown in fig. 1, a voice signal is divided into a plurality of frame signals according to the short-time stable characteristic of the voice signal, each frame signal is windowed, fourier transformation is performed to obtain the amplitude-frequency characteristic of the signal, then energy spectrum density is calculated, and the spectrogram corresponding to the voice signal is obtained by performing logarithmic transformation and color mapping on the energy spectrum density.
Referring to the image processing method based on deep learning, a spectrogram of a speaker is generated according to the method shown in fig. 1, and the spectrogram is used as original data and is input into a DenseNet module for spatial information extraction. As shown in fig. 2, the DenseNet module includes 1 initial convolution, N Dense connection modules, a Dense Block, and a plurality of transport layer transitions. The DenseNet module of the L layers comprises L (L+1)/2 connections, each connection outputs 1 layer of characteristic diagrams, L different layers of characteristic diagrams are obtained, and the characteristic diagrams of the different layers contain different spatial characteristic information.
As shown in fig. 3, the Dense Block includes x0, x1, &..; x0, x1, & gt, wherein xl-1, xl is the 0 th layer, the 1 st layer, and the characteristic diagram of the first layer, the characteristic diagram of the first layer is obtained by splicing the characteristic diagrams of each layer and then by nonlinear transformation Hl (x), and spliced characteristic information H is obtained by nonlinear transformation Hl (x) l ([x 0 ,x 1 ,......x l-1 ]) Splicing characteristic information H l ([x 0 ,x 1 ,......x l-1 ]) Feature map x of the first layer is obtained through feature mapping of an activation function gamma (x) l ,x l The calculation formula is shown as follows:
x l =γ(H l ([x 0 ,x 1 ,......x l-1 ]))
wherein γ (x) represents the activation function, wherein λ 1 ,λ 2 Is a multiplier factor andis not an integer.
The nonlinear transformation H (x) consists of a1×1 convolution and a 3×3 convolution, the 1×1 convolution is called a bottleneck Layer (Bottle-negk Layer), the number of output channels is 4K1, K1 is a super parameter, called a growth rate, the function of the nonlinear transformation H (x) is to fuse the characteristics of each channel, and the number of characteristic graphs input to the 3×3 convolution is reduced by dimension reduction, so that the calculation amount of a network is reduced.
In order to prevent the feature dimension from growing too fast with the increase of the network layer number, the Dense Block modules are connected through the transmission layer Transition. Assuming that the number of input channels of the transmission layer is K2, compressing the number of input channels K2 by adopting 1X 1 convolution, enabling the transmission layer to play a role in model compression, reducing the size of a feature map through 2X 2 pooling operation, and reducing the number of parameters of a network.
When DenseNet is applied to voiceprint recognition to process a spectrogram, the spectrogram can be regarded as a texture image, the voice personality characteristics of a speaker are reflected on the space geometrical characteristics and time sequence characteristics among pixels, and when the DenseNet is utilized to carry out voiceprint recognition, the characteristic reuse characteristic and the characteristic accumulated on a convolution pooling layer are utilized, so that the original sequence characteristic of the voice can be fully reflected. However, as the convolutional pooling layer increases, the data volume of the network increases, and long-term dependence problems can occur when the bottleneck layer and the transmission time span of DenseNet are long, so that gradient explosion or disappearance is caused.
The LSTM unit is also called an LSTM network module or an LSTM network, and utilizes the gating concept to control the flow of information through a gating mechanism, so that the problem of gradient disappearance is relieved. The LSTM unit generally comprises an input gate, an output gate and a forget gate, wherein the output gate determines the output unit information; the forgetting gate decides the information to be forgotten, the structure of the LSTM unit is shown in fig. 4, and the LSTM network belongs to the prior art, and the contents thereof are not described in detail herein.
As shown in fig. 5, the spectrogram in fig. 1 is input as original data to an ED module, where the ED module includes deconvolution processing of the information of the spectrogram, trend information processing of the information of the spectrogram, fusion of the deconvoluted information and the trend information processing information, and convolution processing of the fused deconvoluted information and the trend information processing information to generate enhanced information. Deconvolution processing is used for carrying out deconvolution matrix perspective on the spectrogram information to obtain high-dimensional information and main features behind the spectrogram information, so that feature learning of different scales is realized; the trend information processing is used for learning the information of the spectrogram and analyzing trend information contained in the front and back of the content. In some embodiments, trend information contained in the spectrogram is calculated by mean square error. The convolution processing is to perform comprehensive feature extraction on the information after the deconvolution processing and the information after the trend information processing, and process and generate enhanced information. The convolution processing can avoid the occurrence of noise introduced when only deconvolution is adopted to process the spectrogram information and the occurrence of information loss when only trend information is adopted to process the spectrogram information.
Because the influence of information at different moments on the state of the current moment has different specific gravity of input voice, the information which is too long does not have great influence on the current moment by the information at the previous moment, and the model needs to evaluate the importance of the output generated at different moments, a global attention mechanism is introduced. The attention mechanism is an automatic weighting mechanism, which can link different modules in a weighted form, so that a model is forced to learn to concentrate attention on a specific part of an input sequence, namely, more attention is allocated to a key part of things, and a specific area is allocated with larger weight through calculation of attention probability distribution.
Therefore, the application provides a voiceprint recognition method based on a global attention mechanism, which comprises the following steps:
s100: obtaining a spectrogram corresponding to the voice signal through dividing, windowing, fourier transforming, energy density spectrum, logarithmic transforming and color mapping the voice signal;
s200: taking the spectrogram obtained in the step S100 as input, inputting the input into a DenseNet module for spatial feature extraction, and obtaining the spatial sequence information of the voice signal;
s300: the information of the spectrogram obtained in the step S100 is copied and sent to the LSTM units, and after t LSTM units are passed, the time sequence information of the voice signal is conveniently and fully extracted;
s400: and (3) copying the information of the spectrogram obtained in the step (S100) and sending the information to an ED module, wherein the ED module comprises deconvolution processing of the information of the spectrogram, trend information processing of the information of the spectrogram, fusion of the deconvolution processed information and the trend information processed information, and convolution processing of the fused deconvolution processed information and the trend processed information to generate enhanced information.
Specifically, the step 400 includes:
step S401: deconvolution processing is carried out on the information of the spectrogram;
Ot=s1*(a1-1)+k1-2*p1
wherein a1 is a matrix of spectrogram pixel points; s1 is the length of each shift of the convolution kernel; k1 is the size of the convolution kernel, and when the size of the convolution kernel is not matched with the size of the spectrogram matrix a1, p1 is the first filling matrix; when the size of the convolution kernel is matched with the size of the spectrogram matrix a1, p1 is 0; ot is the information matrix after deconvolution;
step S402: trend information processing is carried out on the information of the spectrogram;
in the matrix of the spectrogram pixel points, the trend information of the pixel point positions is obtained through numerical calculation of k periods near the position coordinates (i, g) of each pixel point, and the calculation formula is as follows:
wherein ,is trend information of pixel point position coordinates (i, g), x (i,g) X is the original information of pixel point location coordinates (i, g) (i,g+j) The original information of pixel point position coordinates (i, g+j), j epsilon (-k, k), k being a positive integer;
calculating the position of each pixel point of the spectrogram through the calculation formula to obtain an information matrix Dt after trend information processing of the spectrogram, wherein the formula of the Dt is as follows:
wherein n is a positive integer, and m is a positive integer;
step S403: fusing the information after deconvolution processing and the information after trend information processing;
the information matrix obtained after deconvolution processing is Ot, the information matrix obtained by trend information processing is Dt, and the information obtained after deconvolution processing and the information obtained after trend information processing are fused to form an information fusion matrix OD:
wherein ,and r is a trend scaling factor for controlling the size of the information matrix Dt obtained by the trend information processing.
Step S404: the information matrix OD of the information after the fused deconvolution processing and the information after the trend processing are subjected to convolution processing to generate enhanced information, the information fusion matrix OD is used as input of the convolution processing, the information fusion matrix OD is subjected to feature extraction, and the calculation formula is as follows:
where ED is enhancement information, k2 is the size of the convolution kernel, s2 is the step of the convolution kernel moving, and when the size of the convolution kernel is not matched with the OD size of the information fusion matrix, p2 is the second filling matrix; when the size of the convolution kernel matches the size of the information fusion matrix OD, p2 is 0.
S500: and splicing processing results of the DenseNet module and the LSTM unit to form space-time fusion information, carrying out information fusion on the space-time fusion information and enhancement information ED to form space-time enhancement information, giving different weights to the space-time enhancement information by using an attention mechanism, forming a total Loss function by combining a Softmax Loss function and a Center Loss function, and identifying the category of the voiceprint by using the total Loss function.
Specifically, step S500 includes the following specific steps:
s501: splicing the processing results of the DenseNet module and the LSTM unit to form space-time fusion information, and carrying out information fusion on the space-time fusion information and the enhancement information to form space-time enhancement information;
s502: the weight given by the attention mechanism is used for the space-time enhancement information, the category of the voiceprint is identified by utilizing key frame voice, and the category of the voiceprint is predicted;
s503: using the Softmax Loss function in combination with the Center Loss function to form a total Loss function, the total Loss function calculating a difference between a true value of the class of voiceprints and a predicted value of the predicted voiceprint class to obtain a Loss value;
s504: judging whether the loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process advances to step S505.
S505: giving new weight to the space-time enhancement information by using an attention mechanism, and identifying the category of the voiceprint again by utilizing key frame voice to predict the category of the voiceprint again;
s506: combining the Softmax Loss function with the Center Loss function again to form a total Loss function, and calculating the difference between the true value of the category of the voiceprint and the predicted value of the category of the voiceprint predicted again by the total Loss function to obtain a new Loss value;
s507: judging whether the new loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process goes to step S505 to step S507 again until the loss value is equal to the preset value, and the classification of the voiceprint is completed.
In some embodiments, the spatio-temporal fusion information is a spatio-temporal fusion information matrix, the enhancement information is an enhancement information matrix, the spatio-temporal fusion information and the enhancement information are information fused to form spatio-temporal enhancement information, specifically, the spatio-temporal fusion information matrix and a matrix of corresponding dimensions of the enhancement information ED matrix are added to obtain spatio-temporal enhancement information M, the spatio-temporal enhancement information M is composed of a plurality of sub-spatio-temporal enhancement information M1, M2, a.i., mt., ms, t is a positive integer, s is a positive integer, the application provides a global attention mechanism, and the spatio-temporal enhancement information M is given different weights, and the specific processes are as follows:
computing sub-spatiotemporal enhancement information m t And sub-space enhancement information m i Similarity alpha ti ,α ti The calculation formula is as follows:
its corresponding scoring function is as follows:
then the vector C is obtained by means of weighted average t :
Sub-space enhancement information m t Sum vector C t Performing head-tail splicing processing, multiplying the processing by a weight matrix Wc of an attention mechanism, and calculating space-time enhancement information based on a global attention mechanism
Wherein tan h is a hyperbolic tangent function.
Finally, the attention weight values were normalized using the softmax classification layer.
Where p is the probability size, since normalization is to limit the value to 0-1]Between [0-1 ]]The numbers between the ranges may represent probabilities, wt isCorresponding feature weights.
In some embodiments, to improve the characterization capability of the voiceprint feature and make the feature have good intra-class compactness and inter-class separability, the application trains two models by adopting a mode of jointly supervising the Softmax Loss function and the Center Loss function (namely, the Center Loss function), and the specific expression of the total Loss function L is as follows:
wherein L represents the total loss function, L s Representing a Softmax loss function, L c Representing the Center Loss function, λ is a factor for balancing the 2 Loss functions, i.e. representing which Loss function is in what proportion, the value of λ ranges from 0 to 1, and taking the square root of the total Loss function is advantageous for reducing the error value of the miscalculated Loss. Wherein the Softmax loss function is expressed as follows:
wherein ,xi Representing the ith feature, y i Is x i Is a true category label of (2);and W is equal to j Respectively represent x i Is distinguished as y i Weight vectors of class and j-th class, i.e. y-th of weight W in the last fully connected layer i And column j; b yi And bj represents y respectively i Class and j bias term; m is the size of the small lot;
the Center Loss function is as follows:
wherein ,Cyi Represents the y i The class center of the class features can be seen, and the center loss function provides a class center for each class, so that each sample participating in training can be close to the center of the same class, and the clustering effect is achieved.
Networks trained under the supervision of Softmax loss functions can be classified into different categories, but the compactness of the features within the category has not been considered; the center loss function, while minimizing intra-class distances, does not take into account the separability between classes. Therefore, the algorithm is optimized by adopting a mode of combining the two, so that the compactness in the class and the separability between the classes can be enlarged, and the high-precision recognition of voiceprints can be realized.
Experimental data set: the voice data set adopted in the experiment is from a Hill Shell Chinese Mandarin AISSEL-ASR 009OS1 open source voice database, 400 speakers participate in recording, the recording process is carried out in a quiet indoor environment respectively from different mouth voice areas of China, each speaker records more than 300 voice fragments, the voices of the same speaker are placed under a folder, 10 voices are randomly extracted for the experiment, one voice is intercepted into voice fragments with the duration of 1.5 seconds, a training set comprises 41909 voice spectrograms, and a test set comprises 10472 voice spectrograms. Experimental environment: hardware platform: GPU: NVIDIA GTX 1080; RAM:32G; and (3) video memory: 16G; operating system: windows experiments are based on the pythorch framework. The voice print recognition method based on the DenseNet-LSTM-ED of the global attention mechanism comprises the following specific experimental procedures:
step S1: the speech signal is divided, windowed, fourier transformed, energy spectral density (or energy density spectrum), logarithmic transformed, and color mapped to obtain a corresponding spectrogram of the speech signal. The spectrogram comprises different time frequencies and voice data energy information thereof, the time sequence characteristic sequence of the spectrogram is used for representing different voice contents, and the personality characteristic information is used for identifying different speakers.
Step S2: and (3) taking the spectrogram obtained in the step (S1) as input, and inputting the input into a DenseNet module for spatial feature extraction. DenseNet modules as a dense connection mechanism, each layer of which receives as its input the outputs of all the layers in front. The dense connection mechanism facilitates gradient back propagation during the training process, a deeper network of feature extraction at the training site, and feature reuse through dense connection dimension information, with a smaller amount of parameters compared to the residual (Resnet) network. The DenseNet module of the t layer comprises t (t+1)/2 connections, so that t characteristic diagrams of different layers are obtained, and the characteristic diagrams of the different layers contain rich spatial characteristic information.
Step S3: the characteristics of the spectrogram obtained in the step S1 are copied to be used as the input of an LSTM unit, and the LSTM unit determines time characteristic information which needs to be reserved in the previous state through a forgetting gate; the input gate determines the important piece of information that is currently about to be input. The output gate is used to determine hidden state information for the next time. After t LSTM units, the final output containing the time characteristic relation can be obtained, and t hidden states can be obtained at the same time.
Step S4: and (2) copying the features of the spectrogram obtained in the step (S1) to serve as input of an ED module, wherein the ED module comprises deconvolution processing of information of the spectrogram, trend information generation processing of the spectrogram, and convolution processing of fused deconvolution processing information and trend generated information to generate enhancement information OD.
Step S5: firstly, after the step S2, feature graphs of t different layers including space information are obtained; after the step S3, t hidden layer states are obtained through t LSTM units, and each hidden layer comprises important time characteristic information; splicing the information of the t hidden layer states and the information of the t feature maps to obtain space-time information, and adding the enhancement information OD of the ED module to obtain space-time enhancement information; in addition, the global attention mechanism is used for processing the obtained space-time enhancement information, attention is focused on effective information in a probability distribution weight mode, global attention weight distribution p of the space-time enhancement information is output, then a total Loss function is formed by combining a Softmax Loss function and a Center Loss function, the striving classification and recognition are carried out on voice prints of a speaker by utilizing the total Loss function, and the steps of multiple recognition are omitted. Clearly, training and classification are performed by a combined Softmax Loss function and Center Loss function supervision mode, and the individual Softmax functions can be used for classifying different classes, but the compactness of the characteristics in the classes is not considered; although the centering loss minimizes the intra-class distance, the separability between classes is not considered, and the compactness in the classes and the separability between classes can be enlarged by adopting a comprehensive mode of the two, so that the high-precision recognition of the voiceprints is realized.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
Those of skill in the art will appreciate that the various operations, methods, steps in the flow, acts, schemes, and alternatives discussed in the present application may be alternated, altered, combined, or eliminated. Further, other steps, means, or steps in a process having various operations, methods, or procedures discussed herein may be alternated, altered, rearranged, disassembled, combined, or eliminated. Further, steps, measures, schemes in the prior art with various operations, methods, flows disclosed in the present application may also be alternated, altered, rearranged, decomposed, combined, or deleted.
The above examples merely represent a few implementations of the disclosed embodiments, which are described in more detail and are not to be construed as limiting the scope of the disclosed embodiments. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made to the disclosed embodiments without departing from the spirit of the disclosed embodiments. Accordingly, the scope of the disclosed embodiments should be determined from the following claims.
Claims (9)
1. A voiceprint recognition method based on a global attention mechanism, namely DenseNet-LSTM-ED, comprises the following steps:
s100: obtaining a spectrogram corresponding to the voice signal through dividing, windowing, fourier transforming, energy density spectrum, logarithmic transforming and color mapping the voice signal;
s200: taking the spectrogram obtained in the step S100 as input, inputting the input into a DenseNet module for spatial feature extraction, and obtaining the spatial information of the voice signal;
s300: the information of the spectrogram obtained in the step S100 is copied and sent to the LSTM units, and after t LSTM units are passed, the time sequence information of the voice signal is conveniently and fully extracted;
s400: the information of the spectrogram obtained in the step S100 is copied and sent to an ED module, the ED module comprises deconvolution processing of the information of the spectrogram, trend information processing of the information of the spectrogram, fusion of the deconvolution processed information and the trend information processed information, and convolution processing of the fused deconvolution processed information and the trend processed information to generate enhancement information;
s500: and splicing processing results of the DenseNet module and the LSTM unit to form space-time fusion information, carrying out information fusion on the space-time fusion information and enhancement information ED to form space-time enhancement information, giving different weights to the space-time enhancement information by using an attention mechanism, forming a total Loss function by combining a Softmax Loss function and a Center Loss function, and identifying the category of the voiceprint by using the total Loss function.
2. The method of claim 1, said step 200 comprising: the DenseNet module comprises: 1 initial convolution, N Dense connection modules, dense Block, multiple transport layer transitions, the Dense connection modules Dense Block comprising x0, x1, xl-1, xl; x0, x1,) the term xl-1, xl is layer 0, layer 1, feature map of layer i, and the feature map of layer i is spliced by the feature maps of each layer, and the spliced feature information H is obtained through nonlinear transformation Hl (x) l ([x 0 ,x 1 ,......x l-1 ]) Splicing characteristic information H l ([x 0 ,x 1 ,......x l-1 ]) Feature map x of the first layer is obtained through feature mapping of an activation function gamma (x) l ,x l The calculation formula is shown as follows:
x l =γ(H l ([x 0 ,x 1 ,......x l-1 ]))
wherein γ (x) represents the activation function, wherein λ 1 ,λ 2 Is a multiplier factor, andis not an integer.
3. The method of claim 1, the step 400 comprising: step S401: deconvolution processing is carried out on the information of the spectrogram;
Ot=s1*(a1-1)+k1-2*p1
wherein a1 is a matrix of spectrogram pixel points; s1 is the length of each shift of the convolution kernel; k1 is the size of the convolution kernel, and when the size of the convolution kernel is not matched with the size of the spectrogram matrix a1, p1 is the first filling matrix; when the size of the convolution kernel is matched with the size of the spectrogram matrix a1, p1 is 0; ot is the information matrix after deconvolution;
step S402: trend information processing is carried out on the information of the spectrogram;
in the matrix of the spectrogram pixel points, the trend information of the pixel point positions is obtained through numerical calculation of k periods near the position coordinates (i, g) of each pixel point, and the calculation formula is as follows:
wherein ,is trend information of pixel point position coordinates (i, g), x (i,g) X is the original information of pixel point location coordinates (i, g) (i,g+j) The original information of pixel point position coordinates (i, g+j), j epsilon (-k, k), k being a positive integer;
calculating the position of each pixel point of the spectrogram through the calculation formula to obtain an information matrix Dt after trend information processing of the spectrogram, wherein the formula of the Dt is as follows:
wherein n is a positive integer, and m is a positive integer;
step S403: fusing the information after deconvolution processing and the information after trend information processing;
the information matrix obtained after deconvolution processing is Ot, the information matrix obtained by trend information processing is Dt, and the information obtained after deconvolution processing and the information obtained after trend information processing are fused to form an information fusion matrix OD:
wherein ,for the equilibrium parameter of the information matrix obtained after deconvolution processing being Ot, r is a trend scaling factor, and is used for controlling the size of the information matrix Dt obtained by trend information processing;
step S404: the information matrix OD of the information after the fused deconvolution processing and the information after the trend processing are subjected to convolution processing to generate enhanced information, the information fusion matrix OD is used as input of the convolution processing, the information fusion matrix OD is subjected to feature extraction, and the calculation formula is as follows:
where ED is enhancement information, k2 is the size of the convolution kernel, s2 is the step of the convolution kernel moving, and when the size of the convolution kernel is not matched with the OD size of the information fusion matrix, p2 is the second filling matrix; when the size of the convolution kernel matches the size of the information fusion matrix OD, p2 is 0.
4. A method according to claim 3, said step 500 comprising:
s501: splicing the processing results of the DenseNet module and the LSTM unit to form space-time fusion information, and carrying out information fusion on the space-time fusion information and the enhancement information to form space-time enhancement information;
s502: the weight given by the attention mechanism is used for the space-time enhancement information, the category of the voiceprint is identified by utilizing key frame voice, and the category of the voiceprint is predicted;
s503: using the Softmax Loss function in combination with the Center Loss function to form a total Loss function, the total Loss function calculating a difference between a true value of the class of voiceprints and a predicted value of the predicted voiceprint class to obtain a Loss value;
s504: judging whether the loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process advances to step S505.
S505: giving new weight to the space-time enhancement information by using an attention mechanism, and identifying the category of the voiceprint again by utilizing key frame voice to predict the category of the voiceprint again;
s506: combining the Softmax Loss function with the Center Loss function again to form a total Loss function, and calculating the difference between the true value of the category of the voiceprint and the predicted value of the category of the voiceprint predicted again by the total Loss function to obtain a new Loss value;
s507: judging whether the new loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process goes to step S505 to step S507 again until the loss value is equal to the preset value, and the classification of the voiceprint is completed.
5. The method according to claim 4, wherein the spatiotemporal fusion information and the enhancement information are added to obtain spatiotemporal enhancement information M, wherein the spatiotemporal enhancement information M is composed of a plurality of sub-spatiotemporal enhancement information M1, M2 of the same dimension, wherein the sub-spatiotemporal enhancement information M1, M2 is a.m., mt., ms is a positive integer, and s is a positive integer, and the global attention mechanism is provided, and the spatiotemporal enhancement information M is given different weight treatments by the following steps:
computing sub-spatiotemporal enhancement information m t And sub-space enhancement information m i Similarity alpha ti ,α ti The calculation formula is as follows:
,
its corresponding scoring function is as follows:
then the vector C is obtained by means of weighted average t :
6. The method of claim 5, wherein the sub-space enhancement information m t Sum vector C t Performing head-tail splicing processing, multiplying the processing by a weight matrix Wc of an attention mechanism, and calculating space-time enhancement information based on a global attention mechanism
Wherein tan h is a hyperbolic tangent function;
finally, the attention weight values were normalized using a softmax classification layer:
where p is the probability size, since normalization is to limit the value to 0-1]Between [0-1 ]]The numbers between the ranges may represent probabilities, wt isCorresponding feature weights.
7. The method of claim 1, the step 500 comprising: the total loss function is as follows:
wherein L represents the total loss function, L s Representing a Softmax loss function, L c Represents the centrloss function, λ is a factor used to balance the 2 loss functions.
8. The method of claim 7, softmax loss function is as follows:
wherein ,xi Representing the ith feature, y i Is x i Is a true category label of (2); w (W) yi And W is equal to j Respectively represent x i Is distinguished as y i Class and j-th class weight vectors; b yi And bj represents y respectively i Class and j bias term; m represents the size of the small lot.
9. The method of claim 8, wherein the Center Loss function is as follows:
wherein ,Cyi Represents the y i Class center, x of class features i Representing the ith feature, m represents the size of the small lot.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310826924.6A CN116863939A (en) | 2023-07-07 | 2023-07-07 | Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310826924.6A CN116863939A (en) | 2023-07-07 | 2023-07-07 | Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116863939A true CN116863939A (en) | 2023-10-10 |
Family
ID=88226306
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310826924.6A Pending CN116863939A (en) | 2023-07-07 | 2023-07-07 | Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116863939A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117292209A (en) * | 2023-11-27 | 2023-12-26 | 之江实验室 | Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization |
CN117598711A (en) * | 2024-01-24 | 2024-02-27 | 中南大学 | QRS complex detection method, device, equipment and medium for electrocardiosignal |
-
2023
- 2023-07-07 CN CN202310826924.6A patent/CN116863939A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117292209A (en) * | 2023-11-27 | 2023-12-26 | 之江实验室 | Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization |
CN117292209B (en) * | 2023-11-27 | 2024-04-05 | 之江实验室 | Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization |
CN117598711A (en) * | 2024-01-24 | 2024-02-27 | 中南大学 | QRS complex detection method, device, equipment and medium for electrocardiosignal |
CN117598711B (en) * | 2024-01-24 | 2024-04-26 | 中南大学 | QRS complex detection method, device, equipment and medium for electrocardiosignal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116863939A (en) | Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED | |
CN108831485B (en) | Speaker identification method based on spectrogram statistical characteristics | |
CN111754988B (en) | Sound scene classification method based on attention mechanism and double-path depth residual error network | |
CN108564942A (en) | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system | |
CN107492382A (en) | Voiceprint extracting method and device based on neutral net | |
Li et al. | Towards Discriminative Representation Learning for Speech Emotion Recognition. | |
CN112580782B (en) | Channel-enhanced dual-attention generation countermeasure network and image generation method | |
CN106952649A (en) | Method for distinguishing speek person based on convolutional neural networks and spectrogram | |
CN110600047A (en) | Perceptual STARGAN-based many-to-many speaker conversion method | |
CN107610707A (en) | A kind of method for recognizing sound-groove and device | |
CN110853680A (en) | double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition | |
Chetty | Biometric liveness checking using multimodal fuzzy fusion | |
CN111583964A (en) | Natural speech emotion recognition method based on multi-mode deep feature learning | |
Ballesteros et al. | Deep4SNet: deep learning for fake speech classification | |
CN110047501B (en) | Many-to-many voice conversion method based on beta-VAE | |
CN114330551A (en) | Multi-modal emotion analysis method based on multi-task learning and attention layer fusion | |
Yuan et al. | Evolving multi-resolution pooling CNN for monaural singing voice separation | |
CN110047504A (en) | Method for distinguishing speek person under identity vector x-vector linear transformation | |
CN113516990A (en) | Voice enhancement method, method for training neural network and related equipment | |
CN116386853A (en) | Intelligent medical-oriented deep separable convolution dual-aggregation federal learning method | |
CN113887883A (en) | Course teaching evaluation implementation method based on voice recognition technology | |
CN114898779A (en) | Multi-mode fused speech emotion recognition method and system | |
Mohanty et al. | Segment based emotion recognition using combined reduced features | |
CN111243621A (en) | Construction method of GRU-SVM deep learning model for synthetic speech detection | |
WO2020098107A1 (en) | Detection model-based emotions analysis method, apparatus and terminal device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |