CN111145760A - Method and neural network model for speaker recognition - Google Patents
Method and neural network model for speaker recognition Download PDFInfo
- Publication number
- CN111145760A CN111145760A CN202010256078.5A CN202010256078A CN111145760A CN 111145760 A CN111145760 A CN 111145760A CN 202010256078 A CN202010256078 A CN 202010256078A CN 111145760 A CN111145760 A CN 111145760A
- Authority
- CN
- China
- Prior art keywords
- vector
- attention
- ith
- vectors
- coding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000003062 neural network model Methods 0.000 title claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 295
- 238000011176 pooling Methods 0.000 claims abstract description 106
- 230000008569 process Effects 0.000 claims abstract description 31
- 238000001228 spectrum Methods 0.000 claims abstract description 28
- 238000011282 treatment Methods 0.000 claims abstract description 7
- 230000000875 corresponding effect Effects 0.000 claims description 45
- 230000003595 spectral effect Effects 0.000 claims description 37
- 230000006870 function Effects 0.000 claims description 23
- 238000012545 processing Methods 0.000 claims description 17
- 230000009466 transformation Effects 0.000 claims description 15
- 230000002596 correlated effect Effects 0.000 claims description 12
- 230000004927 fusion Effects 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 239000012634 fragment Substances 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 6
- 230000002776 aggregation Effects 0.000 description 5
- 238000004220 aggregation Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Embodiments of the present description provide a method and neural network model for speaker recognition. According to the method, firstly, the frequency spectrum characteristics of the audio frequency segment of the speaker are obtained; and then, coding the spectrum characteristics to obtain a vector sequence consisting of N coding vectors at the frame level. Then, respectively applying K kinds of pooling treatment to the vector sequence to obtain corresponding K sub-embedded vectors; the arbitrary ith pooling process comprises the steps of determining attention coefficients of any first encoding vector in the N encoding vectors based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoding vectors by taking the attention coefficients of the encoding vectors as weight factors. Then, based on the K sub-embedding vectors, determining a total embedding vector; and based on the total embedded vector, speaker recognition is performed.
Description
Technical Field
One or more embodiments of the present description relate to the field of machine learning, and more particularly, to methods and neural network models for speech recognition, among other things.
Background
Voiceprints are acoustic features extracted based on the spectral features of a speaker's voice wave. Like a fingerprint, a voiceprint, as a biometric, can reflect the personality and identity information of a speaker. Voiceprint recognition, also known as speaker recognition, is a biometric authentication technique that utilizes specific speaker information contained in a speech signal to automatically recognize the identity of a speaker. The biological authentication technology has wide application prospect in various fields and scenes such as identity authentication, safety verification and the like.
Speaker recognition systems and neural network models have been proposed for authentication and verification, which generally extract feature vectors from speaker audio that express the characteristics of the speaker's voice, and perform speaker recognition based on the feature vectors. However, the recognition accuracy of the existing schemes still needs to be improved.
It is desirable to have an improved scheme that can more effectively acquire feature vectors reflecting the speaking characteristics of a speaker, thereby further improving the accuracy of speaker recognition.
Disclosure of Invention
One or more embodiments of the present specification describe a method and neural network model for speaker recognition, wherein a global multi-head attention mechanism is used for vector aggregation to better capture the speech segment utterance characteristics of a speaker and improve the accuracy of speaker recognition.
According to a first aspect, there is provided a method of speaker recognition, comprising:
acquiring the frequency spectrum characteristics of the audio segments of the speaker;
coding the frequency spectrum characteristics to obtain a vector sequence consisting of N coding vectors at a frame level;
respectively applying K kinds of pooling treatment to the vector sequences to obtain corresponding K sub-embedded vectors; wherein, the ith pooling process in the K pooling processes includes, for any first encoded vector in the N encoded vectors, determining an attention coefficient of the first encoded vector based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoded vectors by using the attention coefficients of the encoded vectors as weighting factors; wherein K is an integer greater than 1;
determining a total embedding vector based on the K sub-embedding vectors;
and carrying out speaker identification based on the total embedded vector.
In different embodiments, the spectral features may comprise mel-frequency spectral cepstral coefficients MFCC features, or mel-scale filter bank FBank features.
In one embodiment, a fully connected feedforward neural network is used to encode the spectral features to obtain the N code vectors. In another embodiment, the vector sequence is obtained by performing convolution processing on the spectral features by using a plurality of convolution kernels.
In a specific embodiment, the ith pooling process specifically includes obtaining a self-attention score of the first coded vector according to a point product of the ith attention vector corresponding to the ith pooling process and the first coded vector; according to the self-attention score, an attention coefficient of a first coding vector is determined such that the attention coefficient positively correlates with the self-attention score of the first coding vector.
In another embodiment, the ith pooling process specifically includes applying a linear transformation and a nonlinear activation function to the first encoded vector sequentially to obtain a first transformed vector; obtaining a self-attention score of the first coding vector according to the point multiplication of the ith attention vector and the first transformation vector; according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
In one embodiment, the above method is performed using a neural network model, the ith attention vector being determined by training the neural network model.
In a further particular embodiment, the ith pooling process particularly comprises determining a self-attention score for the first coded vector according to a self-attention scoring function; dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score; according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.
In a specific embodiment, the total embedded vector is obtained by concatenating the K sub-embedded vectors.
According to a second aspect, there is provided a neural network model for speaker recognition, comprising:
the input layer is used for acquiring the frequency spectrum characteristics of the audio segments of the speaker;
the coding layer is used for coding the frequency spectrum characteristics to obtain a vector sequence consisting of N coding vectors at a frame level;
the pooling layer comprises K pooling units and is used for respectively applying K pooling treatments to the vector sequences to obtain corresponding K sub-embedded vectors; the ith pooling unit in any of the K pooling units is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weighting factors; wherein K is an integer greater than 1;
a fusion layer for determining a total embedding vector based on the K sub-embedding vectors;
and the classification layer is used for carrying out speaker identification based on the total embedded vector.
According to a third aspect, there is provided an apparatus for speaker recognition, comprising:
the input module is configured to acquire the frequency spectrum characteristics of the speaker audio fragment;
the coding module is configured to code the spectrum characteristics to obtain a vector sequence formed by N coding vectors at a frame level;
the pooling module comprises K pooling sub-modules and is configured to apply K pooling treatments to the vector sequences respectively to obtain corresponding K sub-embedded vectors; the ith pooling submodule in any of the K pooling submodules is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weight factors; wherein K is an integer greater than 1;
a fusion module configured to determine a total embedded vector based on the K sub-embedded vectors;
and the classification module is configured to perform speaker recognition based on the total embedded vector.
According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the method of the first or second aspect.
According to the method and the neural network model provided by the embodiment of the specification, after the coding vectors at the frame level are obtained through coding, the coding vectors at each frame level are aggregated into an embedded vector by using an attention mechanism and adopting a global multi-head attention pooling mode. In a further embodiment, the precision or resolution of each attention head can be set, so that a pooling mode of multi-resolution multi-attention heads is formed. The total embedded vector obtained by the method can more comprehensively reflect the characteristics of the speech segment level of the speaker. Furthermore, speaker recognition based on the total embedded vector can achieve higher recognition accuracy.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
FIG. 2 illustrates a schematic structural diagram of a neural network model for speaker recognition, according to one embodiment of the present description;
FIG. 3 shows a graph of attention coefficients as a function of different accuracy coefficients;
FIG. 4 illustrates a flow diagram of a method of speaker recognition, according to one embodiment;
FIG. 5 shows a schematic block diagram of an apparatus for speaker recognition, according to one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. First, a speaker speaks to form a speaker audio. The speaker audio is input to the spectrum extraction unit, from which the basic spectral features are extracted. Such spectral features are input into a neural network model. In the embodiment of fig. 1, the neural network model first processes a sequence of coding vectors at the frame level, i.e., N coding vectors corresponding to the frame level in the audio, based on the input spectral features. The sequence of encoded vectors at the frame level is then aggregated to obtain an embedded vector that can express utterance and speech characteristics at the speech segment (utterance) level. Then, the neural network model predicts a speaker identification (id) based on the embedded vector, and performs speaker recognition.
As mentioned above, the feature vector on which speaker recognition is based, i.e. the above embedded vector in the embodiments of the present specification, is crucial to the accuracy of voiceprint recognition. It is expected that the embedded vector can fully reflect the speaking voice characteristics of the speaker, so that higher recognition accuracy can be obtained. In the embodiment of the present specification, an attention mechanism is utilized, and a global multi-head attention pooling manner is adopted to aggregate N frame-level coding vectors into one embedded vector, so that the embedded vector comprehensively reflects the characteristics of the speaker speech segment level. The process of encoding, pooling, and identifying the audio spectrum is described below in conjunction with the structure of the neural network model.
FIG. 2 illustrates a schematic structural diagram of a neural network model for speaker recognition, according to one embodiment of the present description. It will be appreciated that the neural network model may be implemented by any suitable computer program code language. As shown in fig. 2, the neural network model can be implemented as a deep neural network as a whole, which includes at least an input layer 21, an encoding layer 22, a pooling layer 23, a fusion layer 24, and a classification layer 25, wherein the input layer 21 is used to obtain the spectral features of the speaker audio segment and input the spectral features into the encoding layer 22; the coding layer 22 codes the spectral features to obtain N code vectors h corresponding to the frame level1,h2,…,hN(ii) a The pooling layer 23 comprises K pooling units, each of which applies an attention-based pooling process to the N code vectors resulting in a sub-embedded vector. Thus, pooling layer 23 outputs K sub-embedded vectors e1,e2,…,eK. The fusion layer 24 fuses the K sub-embedding vectors to obtain a total embedding vector E. Then, the classifier 25 performs classification prediction based on the embedding vector E, and outputs a speaker recognition result.
The processing of the above layers is described in detail below.
First, the input layer 21 acquires spectral characteristics of a speaker's audio clip. In one embodiment, the speaker audio segment is inputted into a spectrum extraction unit which is added outside the model, and the spectrum extraction unit extracts the basic spectrum feature from the speaker audio and then inputs the extracted basic spectrum feature into the input layer 21 of the neural network model. In another embodiment, the input layer 21 is designed with a spectrum extraction function. In such a case, the speaker audio may be directly input to the input layer, and the spectral feature may be obtained by performing feature extraction on the input layer.
In one embodiment, the spectral feature is a Mel-frequency cepstral coefficient feature (MFCC). The mel frequency is extracted based on the auditory characteristics of human ears, and is in a nonlinear corresponding relation with the Hz frequency. Extracting MFCC features from speaker audio typically includes the steps of: pre-emphasis, framing, windowing, fourier transform, mel filter bank, Discrete Cosine Transform (DCT), etc. Wherein the pre-emphasis is used to boost the high frequency part to a certain extent, so that the frequency spectrum of the signal becomes flat; the framing is used for dividing the voice into a series of frames according to time; the windowing step is to use a window function to increase the continuity of the left and right ends of the frame. Then, the audio is fourier-transformed, thereby converting the time-domain signal into a frequency-domain signal. Then, the frequency of the frequency domain signal is mapped to the mel scale by using the mel filter bank, thereby obtaining the mel frequency spectrum. Then, a cepstrum coefficient of the mel frequency spectrum is obtained through discrete cosine transform, and then the cepstrum mel frequency spectrum can be obtained.
In another embodiment, the spectral signature uses the Mel-scale filter Bank FBank signature. The FBank characteristic is a spectrum characteristic obtained by mapping the frequency of a frequency domain signal to a mel scale using a mel filter bank. In other words, the MFCC signature is further discrete cosine transformed on the basis of the FBank signature, which is the signature of the MFCC prior to the discrete cosine transform.
In yet another embodiment, the spectral feature may include a linear predictive coding (lpc) feature or a perceptual linear predictive (plp) feature. These features can be extracted by conventional methods. It is also possible to extract other spectral features as a basis for neural network model processing, which is not specifically limited herein.
Based on the above obtained spectral features, the coding layer 22 codes the spectral features to obtain a vector sequence H formed by coding vectors corresponding to the frame level, where H can be written as [ H [1,h2,…,hN]Wherein h istCorresponding to the spectral characteristics of the t-th frame. Since the coding layer 22 is a neural network modelType intermediate layer, the above-mentioned coded vector htAlso referred to as hidden vectors.
The encoding layer 22 may derive the frame-level encoding vectors in a variety of ways.
In one embodiment, the encoding layer 22 is implemented as a multi-layer perceptron, and the spectral features of each frame are processed layer by layer in a fully connected feedforward network manner to obtain the encoding vector h corresponding to each framet。
In another embodiment, the encoding layer 22 processes the spectral features by way of a convolution operation. Specifically, in one example, the coding layer 22 may be embodied as a multi-layer convolutional residual network, which includes a plurality of convolutional layers, each convolutional layer having a corresponding convolutional kernel, and the convolutional kernels are used to perform convolutional operation on the spectral features. The convolution kernels employed by each convolutional layer may be of the same or different size, for example, in one example, the first 2 convolutional layers each employ a 3 x 1 convolution kernel, and the next several convolutional layers each employ a 3 x 3 convolution kernel. The convolution kernels of different layers have different convolution parameters. By such a multi-layer convolution operation, the corresponding code vector h is obtained by processing for the spectral feature of each framet。
After obtaining the N code vectors in various ways, the code layer 22 then outputs a vector sequence H formed by the N code vectors to the pooling layer 23, and the pooling layer 23 pools and aggregates the N code vectors.
As shown in fig. 2, the pooling layer 23 includes K pooling units, each of which determines an attention coefficient for each encoded vector based on an attention mechanism, and aggregates the vector sequences into corresponding sub-embedded vectors based on the attention coefficients. Thus, a pooling unit may also be referred to as an attention head (attention head). In this manner, pooling layer 23 obtains multiple sub-embedding vectors using multiple attention heads. The specific polymerization process of any one of the pooling units, the ith pooling unit, is described below.
For any one code vector h in the N code vectorstThe ith pooling unit determines the encoding vector h according to the ith attention algorithm corresponding to the pooling unittAttention coefficient α oftAnd then taking the attention coefficient of each of the N coded vectors as a weight factor, and summing the coded vectors to obtain the corresponding sub-embedded vector. The ith attention algorithm corresponding to the ith pooling unit may include using a self-attention scoring function to encode the vector h based on the corresponding ith attention vectortScoring to obtain a coding vector htSelf-attention score of (a); determining a coding vector h based on the self-attention scoretSuch that the attention coefficient is positively correlated with the self-attention score. Wherein, the ith attention vector V corresponding to the ith attention unitiMay be preset as a hyper-parameter, but more generally and more preferably, the ith attention vector ViDetermined by training of a neural network model.
In one example, the self-attention scoring function described above may be expressed as:
wherein superscript (i) indicates corresponding to the ith pooling unit,to representThe transposing of (1).
Therefore, the formula (1) is equivalent to the i-th attention vector ViAnd the code vector htIs dot multiplied to obtain a code vector htSelf-attention score of (a).
In other examples, more complex transformations may be performed based on the above, resulting in a self-attention score. In one example, the code vector h may be first encodedtSequentially applying linear transformation and a nonlinear activation function to obtain a transformation vector; according to the transformation vector and the ith attention vector ViDot product of (d) is obtained from the attention score. In particular, in one example, the self-attention scoring function may representComprises the following steps:
wherein the content of the first and second substances,for matrices used for linear transformations, f is a non-linear activation function, such as a sigmoid function;in the case of an optional offset vector,is an optional bias parameter. The above parameters are determined by the training process of the model.
Other self-attention scoring functions, not enumerated herein, may also be derived with minor modifications based on the above equations, such as adding coefficients, adding or subtracting bias terms, and the like.
After determining the code vector htBased on the self-attention score, the code vector h can be determined according to the self-attention scoretThe attention coefficient of (c). The attention coefficient is used as a subsequent weighting factor for measuring the importance of the corresponding code vector in this aggregation.
According to embodiments of the present description, a code vector h may be encodedtAttention coefficient α oftIt is determined that the code vector h is positively correlatedtSelf-attention score of. Specifically, in one example, the self-attention score calculated above may be directly used as the attention coefficient. In another example, the self-attention scores of the N code vectors are normalized, and the normalized ratio is taken as the attention coefficient.
For example, in one example, vector h is encoded using scale normalizationtAttention coefficient α oftCan be expressed as:
in another example, normalization is performed using a softmax function to encode a vector htAttention coefficient α oftCan be expressed as:
on the basis of determining the attention coefficient of each code vector, the attention coefficient of each code vector can be used as a weighting factor to sum up the code vectors, so as to obtain the sub-embedded vector e corresponding to the ith pooling unitiNamely:
it is assumed that each encoded vector is a d-dimensional vector, and the resulting sub-embedded vectors are also d-dimensional vectors.
It can be understood that each pooling unit (i.e. attention head) assigns different attention coefficients to each coded vector in the above manner (only the attention vectors in each pooling unit are different), and then aggregates each coded vector according to the attention coefficients to obtain corresponding sub-embedded vectors, so that K pooling units obtain K sub-embedded vectors respectively. Therefore, the pooling layer 23 uses a multi-head attention method to obtain a plurality of sub-embedding vectors through a plurality of attention-based aggregation methods.
In contrast to some techniques in which a segment of a coded vector is an analysis target, in the multiple attention heads of the embodiments of the present specification, an attention coefficient is determined for a coded vector at a frame level, and then weighted aggregation is performed on each coded vector, which takes into account speech information of a whole frame, and this may be referred to as a global multi-head attention method. By the method, the speaker characteristic information at the speech segment level can be better acquired.
Further, in one embodiment, on the basis of multi-head attention, different precision coefficients T are set for each attention headiTo adjust the accuracy of the attention coefficient or to be referred to as "resolution". This coefficient may be referred to as a temperature coefficient by reference to terms used in the knowledge of distillation and annealing. Temperature coefficient TiThe self-attention score can be adjusted in inverse proportion and introduced into the determination process of the attention coefficient, so that the resolution of the attention coefficient is adjusted.
In particular, in one embodiment, the self-attention score is divided by an ith precision coefficient T preset for an ith pooling unitiObtaining an adjustment score; then, according to the adjustment score, an attention coefficient of the code vector is determined.
When the attention coefficient is determined using equation (4), the attention coefficient after introducing the temperature coefficient correction can be expressed as:
fig. 3 shows a graph of the attention coefficient as a function of different accuracy coefficients. In the leftmost plot of fig. 3, temperature coefficient T =1, it can be seen that the attention coefficient curve transitions very steeply, and the attention coefficient is very sensitive to changes with the similarity score. When the temperature coefficient T rises to 20, the curve changes gently as shown in the middle graph. And when the temperature coefficient rises to 30, as shown in the rightmost graph, the curve changes more gradually. The flatter the curve, the less sensitive the attention coefficient changes with the self-attention score, i.e. the lower the accuracy or resolution of the attention coefficient. It will be appreciated that as the temperature coefficient increases, the resolution of the attention coefficient will decrease. When the temperature coefficient approaches infinity, the attention coefficient is a constant (1/N) and does not change with the similarity score, and the resolution of the attention coefficient is 0.
In one embodiment, different temperature coefficients T may be set in advance for the respective pooling units, thereby forming the pooling layer 23 of the multi-resolution multi-focus head. Each pooling unit in the pooling layer 23 performs attention aggregation on the N encoded vectors using the corresponding temperature coefficient and attention algorithm to obtain corresponding sub-embedded vectors. Then, the pooling layer 23 gets K sub-embedded vectors through K pooling units.
Next, the K sub-embedding vectors are input to the fusion layer 24. The blending layer 24 may blend the K sub-embedding vectors into a total embedding vector E.
Specifically, in one embodiment, the fusion layer 24 splices the K sub-embedded vectors to obtain a total embedded vector E. Assuming that the encoding vector and sub-embedding vector dimensions are d, the total embedding vector dimension thus obtained is K x d. In other examples, the fusion layer 24 may also use other fusion operations to obtain the total embedded vector E, e.g., K sub-embedded vectors may be summed, multiplied bitwise, and so on.
The classification layer 25 may then perform speaker recognition based on the total embedded vector obtained as described above.
In one particular example, the classification layer 25 may include several fully-connected sublayers that may further process the total embedded vector and an output layer that classifies the resulting vector, for example, using a softmax function. The classification result may specifically be a speaker id, or whether the classification result is a binary classification result of a certain speaker. In this manner, the classification layer 25 may output the results of speaker recognition.
Reviewing the structure and the processing process of the neural network model, it can be seen that after the coding layer obtains the coding vectors at the frame level, the pooling layer aggregates the coding vectors at each frame level into an embedded vector by using an attention mechanism and adopting a global multi-head attention pooling mode. In a further embodiment, the accuracy or resolution of each attention head may also be set, thus forming a pooling layer of multi-resolution multi-attention heads. The total embedded vector obtained by the method can more comprehensively reflect the characteristics of the speech segment level of the speaker. Furthermore, speaker recognition based on the total embedded vector can achieve higher recognition accuracy.
In another aspect, a method of speaker recognition is provided. FIG. 4 illustrates a flow diagram of a method of speaker recognition, according to one embodiment. The method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 2, the method includes the following steps.
In step 41, spectral characteristics of the speaker's audio clip are obtained. In different embodiments, the spectral features may comprise mel-frequency spectral cepstral coefficients MFCC features, or mel-scale filter bank FBank features.
Then, in step 42, the spectral features are encoded to obtain a vector sequence of N encoded vectors at the frame level. In one embodiment, a fully connected feedforward neural network is used to encode the spectral features to obtain the N code vectors. In another embodiment, the vector sequence is obtained by performing convolution processing on the spectral features by using a plurality of convolution kernels.
In step 43, K kinds of pooling processing are applied to the vector sequences respectively to obtain corresponding K sub-embedded vectors; wherein, the ith pooling process in the K pooling processes includes, for any first encoded vector in the N encoded vectors, determining an attention coefficient of the first encoded vector based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoded vectors by using the attention coefficients of the encoded vectors as weighting factors; wherein K is an integer greater than 1.
In a specific embodiment, the ith pooling process specifically includes obtaining a self-attention score of the first coded vector according to a point product of the ith attention vector corresponding to the ith pooling process and the first coded vector; according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
In another embodiment, the ith pooling process specifically includes applying a linear transformation and a nonlinear activation function to the first encoded vector sequentially to obtain a first transformed vector; obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling processing and the first transformation vector; according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
In one embodiment, the above method is performed using a neural network model, the ith attention vector being determined by training the neural network model.
In a further particular embodiment, the ith pooling process particularly comprises determining a self-attention score of the first encoding vector according to a self-attention scoring function; dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score; according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.
After K sub-embedding vectors are obtained through K kinds of pooling, a total embedding vector is determined based on the K sub-embedding vectors in step 24. In a specific embodiment, the total embedded vector is obtained by concatenating the K sub-embedded vectors.
Then, in step 25, speaker recognition is performed based on the total embedded vector.
By the method, a multi-head attention mechanism is adopted for pooling, more effective embedded vector expression is obtained, and therefore the speaker identification accuracy is improved.
According to an embodiment of yet another aspect, an apparatus for speaker recognition is provided that may be implemented as any device, platform, or cluster of devices having data storage, computing, processing capabilities. FIG. 5 shows a schematic block diagram of an apparatus for speaker recognition, according to one embodiment. As shown in fig. 5, the speaker recognition apparatus 500 includes:
an input module 51 configured to obtain spectral characteristics of an audio segment of a speaker;
the encoding module 52 is configured to encode the spectrum features to obtain a vector sequence formed by N encoding vectors at a frame level;
the pooling module 53 comprises K pooling sub-modules configured to apply K pooling processes to the vector sequences, respectively, to obtain corresponding K sub-embedded vectors; the ith pooling submodule in any of the K pooling submodules is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weight factors; wherein K is an integer greater than 1;
a fusion module 54 configured to determine a total embedded vector based on the K sub-embedded vectors;
a classification module 55 configured to perform speaker recognition based on the total embedded vector. .
Through the device, efficient and accurate speaker recognition is achieved.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 4.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 4.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.
Claims (18)
1. A method of speaker recognition, comprising:
acquiring the frequency spectrum characteristics of the audio segments of the speaker;
coding the frequency spectrum characteristics to obtain N coding vectors at a frame level, wherein the N coding vectors form a vector sequence;
respectively applying K kinds of pooling treatment to the vector sequences to obtain corresponding K sub-embedded vectors; wherein, the ith pooling process in the K pooling processes includes, for any first encoded vector in the N encoded vectors, determining an attention coefficient of the first encoded vector based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoded vectors by using the attention coefficients of the encoded vectors as weighting factors; wherein K is an integer greater than 1;
determining a total embedding vector based on the K sub-embedding vectors;
and carrying out speaker identification based on the total embedded vector.
2. The method of claim 1, wherein the spectral features comprise mel-frequency spectral cepstral coefficient (MFCC) features, or mel-scale filter bank (FBank) features.
3. The method of claim 1, wherein encoding the spectral feature to obtain a vector sequence of N encoded vectors comprises:
and carrying out convolution processing on the frequency spectrum characteristics by utilizing a plurality of convolution kernels to obtain the vector sequence.
4. The method of claim 1, wherein determining the attention coefficient of the first coded vector based on an ith attention algorithm corresponding to the ith pooling process comprises:
obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling processing and the first coding vector;
according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
5. The method of claim 1, wherein determining the attention coefficient of the first coded vector based on an ith attention algorithm corresponding to the ith pooling process comprises:
applying linear transformation and a nonlinear activation function to the first coding vector sequence to obtain a first transformation vector;
obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling processing and the first transformation vector;
according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
6. The method of claim 4 or 5, wherein the method is performed by a neural network model, the i-th attention vector being determined by training the neural network model.
7. The method of claim 1, wherein determining the attention coefficient of the first coded vector based on an ith attention algorithm corresponding to the ith pooling process comprises:
determining a self-attention score for the first encoding vector according to a self-attention scoring function;
dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score;
according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.
8. The method of claim 1, wherein determining a total embedding vector based on the K sub-embedding vectors comprises:
and splicing the K sub-embedded vectors to obtain the total embedded vector.
9. A neural network model for speaker recognition, comprising:
the input layer is used for acquiring the frequency spectrum characteristics of the audio segments of the speaker;
the coding layer is used for coding the frequency spectrum characteristics to obtain N coding vectors at a frame level, and the N coding vectors form a vector sequence;
the pooling layer comprises K pooling units and is used for respectively applying K pooling treatments to the vector sequences to obtain corresponding K sub-embedded vectors; the ith pooling unit in any of the K pooling units is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weighting factors; wherein K is an integer greater than 1;
a fusion layer for determining a total embedding vector based on the K sub-embedding vectors;
and the classification layer is used for carrying out speaker identification based on the total embedded vector.
10. The neural network model of claim 9, wherein the spectral features comprise mel-frequency spectral cepstral coefficients MFCC features, or mel-scale filter bank FBank features.
11. The neural network model of claim 9, wherein the coding layers comprise a plurality of convolutional layers that convolve the spectral features with a plurality of convolution kernels, resulting in the sequence of vectors.
12. The neural network model of claim 9, wherein the ith pooling unit is specifically configured to:
obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling unit and the first coding vector;
according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
13. The neural network model of claim 9, wherein the ith pooling unit is specifically configured to:
applying linear transformation and a nonlinear activation function to the first coding vector sequence to obtain a first transformation vector;
obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling unit and the first transformation vector;
according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
14. The neural network model of claim 12 or 13, wherein the ith attention vector is determined by training the neural network model.
15. The neural network model of claim 9, wherein the ith pooling unit is specifically configured to:
determining a self-attention score for the first encoding vector according to a self-attention scoring function;
dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score;
according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.
16. The neural network model of claim 9, wherein the fusion layer is specifically configured to:
and splicing the K sub-embedded vectors to obtain the total embedded vector.
17. An apparatus for speaker recognition, comprising:
the input module is configured to acquire the frequency spectrum characteristics of the speaker audio fragment;
the coding module is configured to code the spectrum features to obtain N coding vectors at a frame level, and the N coding vectors form a vector sequence;
the pooling module comprises K pooling sub-modules and is configured to apply K pooling treatments to the vector sequences respectively to obtain corresponding K sub-embedded vectors; the ith pooling submodule in any of the K pooling submodules is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weight factors; wherein K is an integer greater than 1;
a fusion module configured to determine a total embedded vector based on the K sub-embedded vectors;
and the classification module is configured to perform speaker recognition based on the total embedded vector.
18. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010256078.5A CN111145760B (en) | 2020-04-02 | 2020-04-02 | Method and neural network model for speaker recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010256078.5A CN111145760B (en) | 2020-04-02 | 2020-04-02 | Method and neural network model for speaker recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111145760A true CN111145760A (en) | 2020-05-12 |
CN111145760B CN111145760B (en) | 2020-06-30 |
Family
ID=70528742
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010256078.5A Active CN111145760B (en) | 2020-04-02 | 2020-04-02 | Method and neural network model for speaker recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111145760B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111833886A (en) * | 2020-07-27 | 2020-10-27 | 中国科学院声学研究所 | Fully-connected multi-scale residual error network and voiceprint recognition method thereof |
CN112420057A (en) * | 2020-10-26 | 2021-02-26 | 四川长虹电器股份有限公司 | Voiceprint recognition method, device and equipment based on distance coding and storage medium |
CN112634880A (en) * | 2020-12-22 | 2021-04-09 | 北京百度网讯科技有限公司 | Speaker identification method, device, equipment, storage medium and program product |
CN113299295A (en) * | 2021-05-11 | 2021-08-24 | 支付宝(杭州)信息技术有限公司 | Training method and device for voiceprint coding network |
CN113658355A (en) * | 2021-08-09 | 2021-11-16 | 燕山大学 | Deep learning-based authentication identification method and intelligent air lock |
CN116072125A (en) * | 2023-04-07 | 2023-05-05 | 成都信息工程大学 | Method and system for constructing self-supervision speaker recognition model in noise environment |
US11676609B2 (en) | 2020-07-06 | 2023-06-13 | Beijing Century Tal Education Technology Co. Ltd. | Speaker recognition method, electronic device, and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10089556B1 (en) * | 2017-06-12 | 2018-10-02 | Konica Minolta Laboratory U.S.A., Inc. | Self-attention deep neural network for action recognition in surveillance videos |
CN109241536A (en) * | 2018-09-21 | 2019-01-18 | 浙江大学 | It is a kind of based on deep learning from the sentence sort method of attention mechanism |
US20190139541A1 (en) * | 2017-11-08 | 2019-05-09 | International Business Machines Corporation | Sensor Fusion Model to Enhance Machine Conversational Awareness |
CN109801635A (en) * | 2019-01-31 | 2019-05-24 | 北京声智科技有限公司 | A kind of vocal print feature extracting method and device based on attention mechanism |
CN110211574A (en) * | 2019-06-03 | 2019-09-06 | 哈尔滨工业大学 | Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism |
CN110334339A (en) * | 2019-04-30 | 2019-10-15 | 华中科技大学 | It is a kind of based on location aware from the sequence labelling model and mask method of attention mechanism |
US20200043508A1 (en) * | 2018-08-02 | 2020-02-06 | Arizona Board Of Regents On Behalf Of Arizona State University | Systems and methods for a triplet network with attention for speaker diarization |
-
2020
- 2020-04-02 CN CN202010256078.5A patent/CN111145760B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10089556B1 (en) * | 2017-06-12 | 2018-10-02 | Konica Minolta Laboratory U.S.A., Inc. | Self-attention deep neural network for action recognition in surveillance videos |
US20190139541A1 (en) * | 2017-11-08 | 2019-05-09 | International Business Machines Corporation | Sensor Fusion Model to Enhance Machine Conversational Awareness |
US20200043508A1 (en) * | 2018-08-02 | 2020-02-06 | Arizona Board Of Regents On Behalf Of Arizona State University | Systems and methods for a triplet network with attention for speaker diarization |
CN109241536A (en) * | 2018-09-21 | 2019-01-18 | 浙江大学 | It is a kind of based on deep learning from the sentence sort method of attention mechanism |
CN109801635A (en) * | 2019-01-31 | 2019-05-24 | 北京声智科技有限公司 | A kind of vocal print feature extracting method and device based on attention mechanism |
CN110334339A (en) * | 2019-04-30 | 2019-10-15 | 华中科技大学 | It is a kind of based on location aware from the sequence labelling model and mask method of attention mechanism |
CN110211574A (en) * | 2019-06-03 | 2019-09-06 | 哈尔滨工业大学 | Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism |
Non-Patent Citations (3)
Title |
---|
李岚欣: "面向自然语言处理的注意力机制研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
蔡国都: "基于x-vector的说话人识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
郭佳 等: "基于全注意力机制的多步网络流量预测", 《信号处理》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11676609B2 (en) | 2020-07-06 | 2023-06-13 | Beijing Century Tal Education Technology Co. Ltd. | Speaker recognition method, electronic device, and storage medium |
CN111833886A (en) * | 2020-07-27 | 2020-10-27 | 中国科学院声学研究所 | Fully-connected multi-scale residual error network and voiceprint recognition method thereof |
CN111833886B (en) * | 2020-07-27 | 2021-03-23 | 中国科学院声学研究所 | Fully-connected multi-scale residual error network and voiceprint recognition method thereof |
CN112420057A (en) * | 2020-10-26 | 2021-02-26 | 四川长虹电器股份有限公司 | Voiceprint recognition method, device and equipment based on distance coding and storage medium |
CN112634880A (en) * | 2020-12-22 | 2021-04-09 | 北京百度网讯科技有限公司 | Speaker identification method, device, equipment, storage medium and program product |
CN113299295A (en) * | 2021-05-11 | 2021-08-24 | 支付宝(杭州)信息技术有限公司 | Training method and device for voiceprint coding network |
CN113299295B (en) * | 2021-05-11 | 2022-12-30 | 支付宝(杭州)信息技术有限公司 | Training method and device for voiceprint coding network |
CN113658355A (en) * | 2021-08-09 | 2021-11-16 | 燕山大学 | Deep learning-based authentication identification method and intelligent air lock |
CN116072125A (en) * | 2023-04-07 | 2023-05-05 | 成都信息工程大学 | Method and system for constructing self-supervision speaker recognition model in noise environment |
CN116072125B (en) * | 2023-04-07 | 2023-10-17 | 成都信息工程大学 | Method and system for constructing self-supervision speaker recognition model in noise environment |
Also Published As
Publication number | Publication date |
---|---|
CN111145760B (en) | 2020-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111145760B (en) | Method and neural network model for speaker recognition | |
CN108447490B (en) | Voiceprint recognition method and device based on memorability bottleneck characteristics | |
Lokesh et al. | An automatic tamil speech recognition system by using bidirectional recurrent neural network with self-organizing map | |
US11170788B2 (en) | Speaker recognition | |
Sarangi et al. | Optimization of data-driven filterbank for automatic speaker verification | |
CN108281146B (en) | Short voice speaker identification method and device | |
CN111429948B (en) | Voice emotion recognition model and method based on attention convolution neural network | |
WO2019237519A1 (en) | General vector training method, voice clustering method, apparatus, device and medium | |
CN112053695A (en) | Voiceprint recognition method and device, electronic equipment and storage medium | |
US20210217431A1 (en) | Voice morphing apparatus having adjustable parameters | |
Panchapagesan et al. | Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC | |
Ajmera et al. | Fractional Fourier transform based features for speaker recognition using support vector machine | |
CN112053694A (en) | Voiceprint recognition method based on CNN and GRU network fusion | |
CN112735435A (en) | Voiceprint open set identification method with unknown class internal division capability | |
US20210193159A1 (en) | Training a voice morphing apparatus | |
Kim et al. | Speaker-adaptive lip reading with user-dependent padding | |
CN114550703A (en) | Training method and device of voice recognition system, and voice recognition method and device | |
JP2015175859A (en) | Pattern recognition device, pattern recognition method, and pattern recognition program | |
CN113299295B (en) | Training method and device for voiceprint coding network | |
Матиченко et al. | The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space | |
Nijhawan et al. | Speaker recognition using support vector machine | |
Mohammadi et al. | Weighted X-vectors for robust text-independent speaker verification with multiple enrollment utterances | |
Arora et al. | An efficient text-independent speaker verification for short utterance data from Mobile devices | |
Zi et al. | Joint filter combination-based central difference feature extraction and attention-enhanced Dense-Res2Block network for short-utterance speaker recognition | |
Sas et al. | Gender recognition using neural networks and ASR techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |