CN111145760A - Method and neural network model for speaker recognition - Google Patents

Method and neural network model for speaker recognition Download PDF

Info

Publication number
CN111145760A
CN111145760A CN202010256078.5A CN202010256078A CN111145760A CN 111145760 A CN111145760 A CN 111145760A CN 202010256078 A CN202010256078 A CN 202010256078A CN 111145760 A CN111145760 A CN 111145760A
Authority
CN
China
Prior art keywords
vector
attention
ith
vectors
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010256078.5A
Other languages
Chinese (zh)
Other versions
CN111145760B (en
Inventor
王志铭
姚开盛
李小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202010256078.5A priority Critical patent/CN111145760B/en
Publication of CN111145760A publication Critical patent/CN111145760A/en
Application granted granted Critical
Publication of CN111145760B publication Critical patent/CN111145760B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Embodiments of the present description provide a method and neural network model for speaker recognition. According to the method, firstly, the frequency spectrum characteristics of the audio frequency segment of the speaker are obtained; and then, coding the spectrum characteristics to obtain a vector sequence consisting of N coding vectors at the frame level. Then, respectively applying K kinds of pooling treatment to the vector sequence to obtain corresponding K sub-embedded vectors; the arbitrary ith pooling process comprises the steps of determining attention coefficients of any first encoding vector in the N encoding vectors based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoding vectors by taking the attention coefficients of the encoding vectors as weight factors. Then, based on the K sub-embedding vectors, determining a total embedding vector; and based on the total embedded vector, speaker recognition is performed.

Description

Method and neural network model for speaker recognition
Technical Field
One or more embodiments of the present description relate to the field of machine learning, and more particularly, to methods and neural network models for speech recognition, among other things.
Background
Voiceprints are acoustic features extracted based on the spectral features of a speaker's voice wave. Like a fingerprint, a voiceprint, as a biometric, can reflect the personality and identity information of a speaker. Voiceprint recognition, also known as speaker recognition, is a biometric authentication technique that utilizes specific speaker information contained in a speech signal to automatically recognize the identity of a speaker. The biological authentication technology has wide application prospect in various fields and scenes such as identity authentication, safety verification and the like.
Speaker recognition systems and neural network models have been proposed for authentication and verification, which generally extract feature vectors from speaker audio that express the characteristics of the speaker's voice, and perform speaker recognition based on the feature vectors. However, the recognition accuracy of the existing schemes still needs to be improved.
It is desirable to have an improved scheme that can more effectively acquire feature vectors reflecting the speaking characteristics of a speaker, thereby further improving the accuracy of speaker recognition.
Disclosure of Invention
One or more embodiments of the present specification describe a method and neural network model for speaker recognition, wherein a global multi-head attention mechanism is used for vector aggregation to better capture the speech segment utterance characteristics of a speaker and improve the accuracy of speaker recognition.
According to a first aspect, there is provided a method of speaker recognition, comprising:
acquiring the frequency spectrum characteristics of the audio segments of the speaker;
coding the frequency spectrum characteristics to obtain a vector sequence consisting of N coding vectors at a frame level;
respectively applying K kinds of pooling treatment to the vector sequences to obtain corresponding K sub-embedded vectors; wherein, the ith pooling process in the K pooling processes includes, for any first encoded vector in the N encoded vectors, determining an attention coefficient of the first encoded vector based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoded vectors by using the attention coefficients of the encoded vectors as weighting factors; wherein K is an integer greater than 1;
determining a total embedding vector based on the K sub-embedding vectors;
and carrying out speaker identification based on the total embedded vector.
In different embodiments, the spectral features may comprise mel-frequency spectral cepstral coefficients MFCC features, or mel-scale filter bank FBank features.
In one embodiment, a fully connected feedforward neural network is used to encode the spectral features to obtain the N code vectors. In another embodiment, the vector sequence is obtained by performing convolution processing on the spectral features by using a plurality of convolution kernels.
In a specific embodiment, the ith pooling process specifically includes obtaining a self-attention score of the first coded vector according to a point product of the ith attention vector corresponding to the ith pooling process and the first coded vector; according to the self-attention score, an attention coefficient of a first coding vector is determined such that the attention coefficient positively correlates with the self-attention score of the first coding vector.
In another embodiment, the ith pooling process specifically includes applying a linear transformation and a nonlinear activation function to the first encoded vector sequentially to obtain a first transformed vector; obtaining a self-attention score of the first coding vector according to the point multiplication of the ith attention vector and the first transformation vector; according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
In one embodiment, the above method is performed using a neural network model, the ith attention vector being determined by training the neural network model.
In a further particular embodiment, the ith pooling process particularly comprises determining a self-attention score for the first coded vector according to a self-attention scoring function; dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score; according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.
In a specific embodiment, the total embedded vector is obtained by concatenating the K sub-embedded vectors.
According to a second aspect, there is provided a neural network model for speaker recognition, comprising:
the input layer is used for acquiring the frequency spectrum characteristics of the audio segments of the speaker;
the coding layer is used for coding the frequency spectrum characteristics to obtain a vector sequence consisting of N coding vectors at a frame level;
the pooling layer comprises K pooling units and is used for respectively applying K pooling treatments to the vector sequences to obtain corresponding K sub-embedded vectors; the ith pooling unit in any of the K pooling units is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weighting factors; wherein K is an integer greater than 1;
a fusion layer for determining a total embedding vector based on the K sub-embedding vectors;
and the classification layer is used for carrying out speaker identification based on the total embedded vector.
According to a third aspect, there is provided an apparatus for speaker recognition, comprising:
the input module is configured to acquire the frequency spectrum characteristics of the speaker audio fragment;
the coding module is configured to code the spectrum characteristics to obtain a vector sequence formed by N coding vectors at a frame level;
the pooling module comprises K pooling sub-modules and is configured to apply K pooling treatments to the vector sequences respectively to obtain corresponding K sub-embedded vectors; the ith pooling submodule in any of the K pooling submodules is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weight factors; wherein K is an integer greater than 1;
a fusion module configured to determine a total embedded vector based on the K sub-embedded vectors;
and the classification module is configured to perform speaker recognition based on the total embedded vector.
According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the method of the first or second aspect.
According to the method and the neural network model provided by the embodiment of the specification, after the coding vectors at the frame level are obtained through coding, the coding vectors at each frame level are aggregated into an embedded vector by using an attention mechanism and adopting a global multi-head attention pooling mode. In a further embodiment, the precision or resolution of each attention head can be set, so that a pooling mode of multi-resolution multi-attention heads is formed. The total embedded vector obtained by the method can more comprehensively reflect the characteristics of the speech segment level of the speaker. Furthermore, speaker recognition based on the total embedded vector can achieve higher recognition accuracy.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
FIG. 2 illustrates a schematic structural diagram of a neural network model for speaker recognition, according to one embodiment of the present description;
FIG. 3 shows a graph of attention coefficients as a function of different accuracy coefficients;
FIG. 4 illustrates a flow diagram of a method of speaker recognition, according to one embodiment;
FIG. 5 shows a schematic block diagram of an apparatus for speaker recognition, according to one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. First, a speaker speaks to form a speaker audio. The speaker audio is input to the spectrum extraction unit, from which the basic spectral features are extracted. Such spectral features are input into a neural network model. In the embodiment of fig. 1, the neural network model first processes a sequence of coding vectors at the frame level, i.e., N coding vectors corresponding to the frame level in the audio, based on the input spectral features. The sequence of encoded vectors at the frame level is then aggregated to obtain an embedded vector that can express utterance and speech characteristics at the speech segment (utterance) level. Then, the neural network model predicts a speaker identification (id) based on the embedded vector, and performs speaker recognition.
As mentioned above, the feature vector on which speaker recognition is based, i.e. the above embedded vector in the embodiments of the present specification, is crucial to the accuracy of voiceprint recognition. It is expected that the embedded vector can fully reflect the speaking voice characteristics of the speaker, so that higher recognition accuracy can be obtained. In the embodiment of the present specification, an attention mechanism is utilized, and a global multi-head attention pooling manner is adopted to aggregate N frame-level coding vectors into one embedded vector, so that the embedded vector comprehensively reflects the characteristics of the speaker speech segment level. The process of encoding, pooling, and identifying the audio spectrum is described below in conjunction with the structure of the neural network model.
FIG. 2 illustrates a schematic structural diagram of a neural network model for speaker recognition, according to one embodiment of the present description. It will be appreciated that the neural network model may be implemented by any suitable computer program code language. As shown in fig. 2, the neural network model can be implemented as a deep neural network as a whole, which includes at least an input layer 21, an encoding layer 22, a pooling layer 23, a fusion layer 24, and a classification layer 25, wherein the input layer 21 is used to obtain the spectral features of the speaker audio segment and input the spectral features into the encoding layer 22; the coding layer 22 codes the spectral features to obtain N code vectors h corresponding to the frame level1,h2,…,hN(ii) a The pooling layer 23 comprises K pooling units, each of which applies an attention-based pooling process to the N code vectors resulting in a sub-embedded vector. Thus, pooling layer 23 outputs K sub-embedded vectors e1,e2,…,eK. The fusion layer 24 fuses the K sub-embedding vectors to obtain a total embedding vector E. Then, the classifier 25 performs classification prediction based on the embedding vector E, and outputs a speaker recognition result.
The processing of the above layers is described in detail below.
First, the input layer 21 acquires spectral characteristics of a speaker's audio clip. In one embodiment, the speaker audio segment is inputted into a spectrum extraction unit which is added outside the model, and the spectrum extraction unit extracts the basic spectrum feature from the speaker audio and then inputs the extracted basic spectrum feature into the input layer 21 of the neural network model. In another embodiment, the input layer 21 is designed with a spectrum extraction function. In such a case, the speaker audio may be directly input to the input layer, and the spectral feature may be obtained by performing feature extraction on the input layer.
In one embodiment, the spectral feature is a Mel-frequency cepstral coefficient feature (MFCC). The mel frequency is extracted based on the auditory characteristics of human ears, and is in a nonlinear corresponding relation with the Hz frequency. Extracting MFCC features from speaker audio typically includes the steps of: pre-emphasis, framing, windowing, fourier transform, mel filter bank, Discrete Cosine Transform (DCT), etc. Wherein the pre-emphasis is used to boost the high frequency part to a certain extent, so that the frequency spectrum of the signal becomes flat; the framing is used for dividing the voice into a series of frames according to time; the windowing step is to use a window function to increase the continuity of the left and right ends of the frame. Then, the audio is fourier-transformed, thereby converting the time-domain signal into a frequency-domain signal. Then, the frequency of the frequency domain signal is mapped to the mel scale by using the mel filter bank, thereby obtaining the mel frequency spectrum. Then, a cepstrum coefficient of the mel frequency spectrum is obtained through discrete cosine transform, and then the cepstrum mel frequency spectrum can be obtained.
In another embodiment, the spectral signature uses the Mel-scale filter Bank FBank signature. The FBank characteristic is a spectrum characteristic obtained by mapping the frequency of a frequency domain signal to a mel scale using a mel filter bank. In other words, the MFCC signature is further discrete cosine transformed on the basis of the FBank signature, which is the signature of the MFCC prior to the discrete cosine transform.
In yet another embodiment, the spectral feature may include a linear predictive coding (lpc) feature or a perceptual linear predictive (plp) feature. These features can be extracted by conventional methods. It is also possible to extract other spectral features as a basis for neural network model processing, which is not specifically limited herein.
Based on the above obtained spectral features, the coding layer 22 codes the spectral features to obtain a vector sequence H formed by coding vectors corresponding to the frame level, where H can be written as [ H [1,h2,…,hN]Wherein h istCorresponding to the spectral characteristics of the t-th frame. Since the coding layer 22 is a neural network modelType intermediate layer, the above-mentioned coded vector htAlso referred to as hidden vectors.
The encoding layer 22 may derive the frame-level encoding vectors in a variety of ways.
In one embodiment, the encoding layer 22 is implemented as a multi-layer perceptron, and the spectral features of each frame are processed layer by layer in a fully connected feedforward network manner to obtain the encoding vector h corresponding to each framet
In another embodiment, the encoding layer 22 processes the spectral features by way of a convolution operation. Specifically, in one example, the coding layer 22 may be embodied as a multi-layer convolutional residual network, which includes a plurality of convolutional layers, each convolutional layer having a corresponding convolutional kernel, and the convolutional kernels are used to perform convolutional operation on the spectral features. The convolution kernels employed by each convolutional layer may be of the same or different size, for example, in one example, the first 2 convolutional layers each employ a 3 x 1 convolution kernel, and the next several convolutional layers each employ a 3 x 3 convolution kernel. The convolution kernels of different layers have different convolution parameters. By such a multi-layer convolution operation, the corresponding code vector h is obtained by processing for the spectral feature of each framet
After obtaining the N code vectors in various ways, the code layer 22 then outputs a vector sequence H formed by the N code vectors to the pooling layer 23, and the pooling layer 23 pools and aggregates the N code vectors.
As shown in fig. 2, the pooling layer 23 includes K pooling units, each of which determines an attention coefficient for each encoded vector based on an attention mechanism, and aggregates the vector sequences into corresponding sub-embedded vectors based on the attention coefficients. Thus, a pooling unit may also be referred to as an attention head (attention head). In this manner, pooling layer 23 obtains multiple sub-embedding vectors using multiple attention heads. The specific polymerization process of any one of the pooling units, the ith pooling unit, is described below.
For any one code vector h in the N code vectorstThe ith pooling unit determines the encoding vector h according to the ith attention algorithm corresponding to the pooling unittAttention coefficient α oftAnd then taking the attention coefficient of each of the N coded vectors as a weight factor, and summing the coded vectors to obtain the corresponding sub-embedded vector. The ith attention algorithm corresponding to the ith pooling unit may include using a self-attention scoring function to encode the vector h based on the corresponding ith attention vectortScoring to obtain a coding vector htSelf-attention score of (a); determining a coding vector h based on the self-attention scoretSuch that the attention coefficient is positively correlated with the self-attention score. Wherein, the ith attention vector V corresponding to the ith attention unitiMay be preset as a hyper-parameter, but more generally and more preferably, the ith attention vector ViDetermined by training of a neural network model.
In one example, the self-attention scoring function described above may be expressed as:
Figure 752665DEST_PATH_IMAGE001
(1)
wherein superscript (i) indicates corresponding to the ith pooling unit,
Figure 951565DEST_PATH_IMAGE002
to represent
Figure 948340DEST_PATH_IMAGE003
The transposing of (1).
Therefore, the formula (1) is equivalent to the i-th attention vector ViAnd the code vector htIs dot multiplied to obtain a code vector htSelf-attention score of (a).
In other examples, more complex transformations may be performed based on the above, resulting in a self-attention score. In one example, the code vector h may be first encodedtSequentially applying linear transformation and a nonlinear activation function to obtain a transformation vector; according to the transformation vector and the ith attention vector ViDot product of (d) is obtained from the attention score. In particular, in one example, the self-attention scoring function may representComprises the following steps:
Figure 207283DEST_PATH_IMAGE004
(2)
wherein the content of the first and second substances,
Figure 164875DEST_PATH_IMAGE005
for matrices used for linear transformations, f is a non-linear activation function, such as a sigmoid function;
Figure 167466DEST_PATH_IMAGE006
in the case of an optional offset vector,
Figure 18747DEST_PATH_IMAGE007
is an optional bias parameter. The above parameters are determined by the training process of the model.
Other self-attention scoring functions, not enumerated herein, may also be derived with minor modifications based on the above equations, such as adding coefficients, adding or subtracting bias terms, and the like.
After determining the code vector htBased on the self-attention score, the code vector h can be determined according to the self-attention scoretThe attention coefficient of (c). The attention coefficient is used as a subsequent weighting factor for measuring the importance of the corresponding code vector in this aggregation.
According to embodiments of the present description, a code vector h may be encodedtAttention coefficient α oftIt is determined that the code vector h is positively correlatedtSelf-attention score of
Figure 448592DEST_PATH_IMAGE008
. Specifically, in one example, the self-attention score calculated above may be directly used as the attention coefficient. In another example, the self-attention scores of the N code vectors are normalized, and the normalized ratio is taken as the attention coefficient.
For example, in one example, vector h is encoded using scale normalizationtAttention coefficient α oftCan be expressed as:
Figure 955796DEST_PATH_IMAGE009
(3)
in another example, normalization is performed using a softmax function to encode a vector htAttention coefficient α oftCan be expressed as:
Figure 230920DEST_PATH_IMAGE010
(4)
on the basis of determining the attention coefficient of each code vector, the attention coefficient of each code vector can be used as a weighting factor to sum up the code vectors, so as to obtain the sub-embedded vector e corresponding to the ith pooling unitiNamely:
Figure 77653DEST_PATH_IMAGE011
(5)
it is assumed that each encoded vector is a d-dimensional vector, and the resulting sub-embedded vectors are also d-dimensional vectors.
It can be understood that each pooling unit (i.e. attention head) assigns different attention coefficients to each coded vector in the above manner (only the attention vectors in each pooling unit are different), and then aggregates each coded vector according to the attention coefficients to obtain corresponding sub-embedded vectors, so that K pooling units obtain K sub-embedded vectors respectively. Therefore, the pooling layer 23 uses a multi-head attention method to obtain a plurality of sub-embedding vectors through a plurality of attention-based aggregation methods.
In contrast to some techniques in which a segment of a coded vector is an analysis target, in the multiple attention heads of the embodiments of the present specification, an attention coefficient is determined for a coded vector at a frame level, and then weighted aggregation is performed on each coded vector, which takes into account speech information of a whole frame, and this may be referred to as a global multi-head attention method. By the method, the speaker characteristic information at the speech segment level can be better acquired.
Further, in one embodiment, on the basis of multi-head attention, different precision coefficients T are set for each attention headiTo adjust the accuracy of the attention coefficient or to be referred to as "resolution". This coefficient may be referred to as a temperature coefficient by reference to terms used in the knowledge of distillation and annealing. Temperature coefficient TiThe self-attention score can be adjusted in inverse proportion and introduced into the determination process of the attention coefficient, so that the resolution of the attention coefficient is adjusted.
In particular, in one embodiment, the self-attention score is divided by an ith precision coefficient T preset for an ith pooling unitiObtaining an adjustment score; then, according to the adjustment score, an attention coefficient of the code vector is determined.
When the attention coefficient is determined using equation (4), the attention coefficient after introducing the temperature coefficient correction can be expressed as:
Figure 678399DEST_PATH_IMAGE012
(6)
fig. 3 shows a graph of the attention coefficient as a function of different accuracy coefficients. In the leftmost plot of fig. 3, temperature coefficient T =1, it can be seen that the attention coefficient curve transitions very steeply, and the attention coefficient is very sensitive to changes with the similarity score. When the temperature coefficient T rises to 20, the curve changes gently as shown in the middle graph. And when the temperature coefficient rises to 30, as shown in the rightmost graph, the curve changes more gradually. The flatter the curve, the less sensitive the attention coefficient changes with the self-attention score, i.e. the lower the accuracy or resolution of the attention coefficient. It will be appreciated that as the temperature coefficient increases, the resolution of the attention coefficient will decrease. When the temperature coefficient approaches infinity, the attention coefficient is a constant (1/N) and does not change with the similarity score, and the resolution of the attention coefficient is 0.
In one embodiment, different temperature coefficients T may be set in advance for the respective pooling units, thereby forming the pooling layer 23 of the multi-resolution multi-focus head. Each pooling unit in the pooling layer 23 performs attention aggregation on the N encoded vectors using the corresponding temperature coefficient and attention algorithm to obtain corresponding sub-embedded vectors. Then, the pooling layer 23 gets K sub-embedded vectors through K pooling units.
Next, the K sub-embedding vectors are input to the fusion layer 24. The blending layer 24 may blend the K sub-embedding vectors into a total embedding vector E.
Specifically, in one embodiment, the fusion layer 24 splices the K sub-embedded vectors to obtain a total embedded vector E. Assuming that the encoding vector and sub-embedding vector dimensions are d, the total embedding vector dimension thus obtained is K x d. In other examples, the fusion layer 24 may also use other fusion operations to obtain the total embedded vector E, e.g., K sub-embedded vectors may be summed, multiplied bitwise, and so on.
The classification layer 25 may then perform speaker recognition based on the total embedded vector obtained as described above.
In one particular example, the classification layer 25 may include several fully-connected sublayers that may further process the total embedded vector and an output layer that classifies the resulting vector, for example, using a softmax function. The classification result may specifically be a speaker id, or whether the classification result is a binary classification result of a certain speaker. In this manner, the classification layer 25 may output the results of speaker recognition.
Reviewing the structure and the processing process of the neural network model, it can be seen that after the coding layer obtains the coding vectors at the frame level, the pooling layer aggregates the coding vectors at each frame level into an embedded vector by using an attention mechanism and adopting a global multi-head attention pooling mode. In a further embodiment, the accuracy or resolution of each attention head may also be set, thus forming a pooling layer of multi-resolution multi-attention heads. The total embedded vector obtained by the method can more comprehensively reflect the characteristics of the speech segment level of the speaker. Furthermore, speaker recognition based on the total embedded vector can achieve higher recognition accuracy.
In another aspect, a method of speaker recognition is provided. FIG. 4 illustrates a flow diagram of a method of speaker recognition, according to one embodiment. The method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 2, the method includes the following steps.
In step 41, spectral characteristics of the speaker's audio clip are obtained. In different embodiments, the spectral features may comprise mel-frequency spectral cepstral coefficients MFCC features, or mel-scale filter bank FBank features.
Then, in step 42, the spectral features are encoded to obtain a vector sequence of N encoded vectors at the frame level. In one embodiment, a fully connected feedforward neural network is used to encode the spectral features to obtain the N code vectors. In another embodiment, the vector sequence is obtained by performing convolution processing on the spectral features by using a plurality of convolution kernels.
In step 43, K kinds of pooling processing are applied to the vector sequences respectively to obtain corresponding K sub-embedded vectors; wherein, the ith pooling process in the K pooling processes includes, for any first encoded vector in the N encoded vectors, determining an attention coefficient of the first encoded vector based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoded vectors by using the attention coefficients of the encoded vectors as weighting factors; wherein K is an integer greater than 1.
In a specific embodiment, the ith pooling process specifically includes obtaining a self-attention score of the first coded vector according to a point product of the ith attention vector corresponding to the ith pooling process and the first coded vector; according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
In another embodiment, the ith pooling process specifically includes applying a linear transformation and a nonlinear activation function to the first encoded vector sequentially to obtain a first transformed vector; obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling processing and the first transformation vector; according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
In one embodiment, the above method is performed using a neural network model, the ith attention vector being determined by training the neural network model.
In a further particular embodiment, the ith pooling process particularly comprises determining a self-attention score of the first encoding vector according to a self-attention scoring function; dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score; according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.
After K sub-embedding vectors are obtained through K kinds of pooling, a total embedding vector is determined based on the K sub-embedding vectors in step 24. In a specific embodiment, the total embedded vector is obtained by concatenating the K sub-embedded vectors.
Then, in step 25, speaker recognition is performed based on the total embedded vector.
By the method, a multi-head attention mechanism is adopted for pooling, more effective embedded vector expression is obtained, and therefore the speaker identification accuracy is improved.
According to an embodiment of yet another aspect, an apparatus for speaker recognition is provided that may be implemented as any device, platform, or cluster of devices having data storage, computing, processing capabilities. FIG. 5 shows a schematic block diagram of an apparatus for speaker recognition, according to one embodiment. As shown in fig. 5, the speaker recognition apparatus 500 includes:
an input module 51 configured to obtain spectral characteristics of an audio segment of a speaker;
the encoding module 52 is configured to encode the spectrum features to obtain a vector sequence formed by N encoding vectors at a frame level;
the pooling module 53 comprises K pooling sub-modules configured to apply K pooling processes to the vector sequences, respectively, to obtain corresponding K sub-embedded vectors; the ith pooling submodule in any of the K pooling submodules is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weight factors; wherein K is an integer greater than 1;
a fusion module 54 configured to determine a total embedded vector based on the K sub-embedded vectors;
a classification module 55 configured to perform speaker recognition based on the total embedded vector. .
Through the device, efficient and accurate speaker recognition is achieved.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 4.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 4.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (18)

1. A method of speaker recognition, comprising:
acquiring the frequency spectrum characteristics of the audio segments of the speaker;
coding the frequency spectrum characteristics to obtain N coding vectors at a frame level, wherein the N coding vectors form a vector sequence;
respectively applying K kinds of pooling treatment to the vector sequences to obtain corresponding K sub-embedded vectors; wherein, the ith pooling process in the K pooling processes includes, for any first encoded vector in the N encoded vectors, determining an attention coefficient of the first encoded vector based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoded vectors by using the attention coefficients of the encoded vectors as weighting factors; wherein K is an integer greater than 1;
determining a total embedding vector based on the K sub-embedding vectors;
and carrying out speaker identification based on the total embedded vector.
2. The method of claim 1, wherein the spectral features comprise mel-frequency spectral cepstral coefficient (MFCC) features, or mel-scale filter bank (FBank) features.
3. The method of claim 1, wherein encoding the spectral feature to obtain a vector sequence of N encoded vectors comprises:
and carrying out convolution processing on the frequency spectrum characteristics by utilizing a plurality of convolution kernels to obtain the vector sequence.
4. The method of claim 1, wherein determining the attention coefficient of the first coded vector based on an ith attention algorithm corresponding to the ith pooling process comprises:
obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling processing and the first coding vector;
according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
5. The method of claim 1, wherein determining the attention coefficient of the first coded vector based on an ith attention algorithm corresponding to the ith pooling process comprises:
applying linear transformation and a nonlinear activation function to the first coding vector sequence to obtain a first transformation vector;
obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling processing and the first transformation vector;
according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
6. The method of claim 4 or 5, wherein the method is performed by a neural network model, the i-th attention vector being determined by training the neural network model.
7. The method of claim 1, wherein determining the attention coefficient of the first coded vector based on an ith attention algorithm corresponding to the ith pooling process comprises:
determining a self-attention score for the first encoding vector according to a self-attention scoring function;
dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score;
according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.
8. The method of claim 1, wherein determining a total embedding vector based on the K sub-embedding vectors comprises:
and splicing the K sub-embedded vectors to obtain the total embedded vector.
9. A neural network model for speaker recognition, comprising:
the input layer is used for acquiring the frequency spectrum characteristics of the audio segments of the speaker;
the coding layer is used for coding the frequency spectrum characteristics to obtain N coding vectors at a frame level, and the N coding vectors form a vector sequence;
the pooling layer comprises K pooling units and is used for respectively applying K pooling treatments to the vector sequences to obtain corresponding K sub-embedded vectors; the ith pooling unit in any of the K pooling units is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weighting factors; wherein K is an integer greater than 1;
a fusion layer for determining a total embedding vector based on the K sub-embedding vectors;
and the classification layer is used for carrying out speaker identification based on the total embedded vector.
10. The neural network model of claim 9, wherein the spectral features comprise mel-frequency spectral cepstral coefficients MFCC features, or mel-scale filter bank FBank features.
11. The neural network model of claim 9, wherein the coding layers comprise a plurality of convolutional layers that convolve the spectral features with a plurality of convolution kernels, resulting in the sequence of vectors.
12. The neural network model of claim 9, wherein the ith pooling unit is specifically configured to:
obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling unit and the first coding vector;
according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
13. The neural network model of claim 9, wherein the ith pooling unit is specifically configured to:
applying linear transformation and a nonlinear activation function to the first coding vector sequence to obtain a first transformation vector;
obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling unit and the first transformation vector;
according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.
14. The neural network model of claim 12 or 13, wherein the ith attention vector is determined by training the neural network model.
15. The neural network model of claim 9, wherein the ith pooling unit is specifically configured to:
determining a self-attention score for the first encoding vector according to a self-attention scoring function;
dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score;
according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.
16. The neural network model of claim 9, wherein the fusion layer is specifically configured to:
and splicing the K sub-embedded vectors to obtain the total embedded vector.
17. An apparatus for speaker recognition, comprising:
the input module is configured to acquire the frequency spectrum characteristics of the speaker audio fragment;
the coding module is configured to code the spectrum features to obtain N coding vectors at a frame level, and the N coding vectors form a vector sequence;
the pooling module comprises K pooling sub-modules and is configured to apply K pooling treatments to the vector sequences respectively to obtain corresponding K sub-embedded vectors; the ith pooling submodule in any of the K pooling submodules is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weight factors; wherein K is an integer greater than 1;
a fusion module configured to determine a total embedded vector based on the K sub-embedded vectors;
and the classification module is configured to perform speaker recognition based on the total embedded vector.
18. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-8.
CN202010256078.5A 2020-04-02 2020-04-02 Method and neural network model for speaker recognition Active CN111145760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010256078.5A CN111145760B (en) 2020-04-02 2020-04-02 Method and neural network model for speaker recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010256078.5A CN111145760B (en) 2020-04-02 2020-04-02 Method and neural network model for speaker recognition

Publications (2)

Publication Number Publication Date
CN111145760A true CN111145760A (en) 2020-05-12
CN111145760B CN111145760B (en) 2020-06-30

Family

ID=70528742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010256078.5A Active CN111145760B (en) 2020-04-02 2020-04-02 Method and neural network model for speaker recognition

Country Status (1)

Country Link
CN (1) CN111145760B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833886A (en) * 2020-07-27 2020-10-27 中国科学院声学研究所 Fully-connected multi-scale residual error network and voiceprint recognition method thereof
CN112420057A (en) * 2020-10-26 2021-02-26 四川长虹电器股份有限公司 Voiceprint recognition method, device and equipment based on distance coding and storage medium
CN112634880A (en) * 2020-12-22 2021-04-09 北京百度网讯科技有限公司 Speaker identification method, device, equipment, storage medium and program product
CN113299295A (en) * 2021-05-11 2021-08-24 支付宝(杭州)信息技术有限公司 Training method and device for voiceprint coding network
CN113658355A (en) * 2021-08-09 2021-11-16 燕山大学 Deep learning-based authentication identification method and intelligent air lock
CN116072125A (en) * 2023-04-07 2023-05-05 成都信息工程大学 Method and system for constructing self-supervision speaker recognition model in noise environment
US11676609B2 (en) 2020-07-06 2023-06-13 Beijing Century Tal Education Technology Co. Ltd. Speaker recognition method, electronic device, and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10089556B1 (en) * 2017-06-12 2018-10-02 Konica Minolta Laboratory U.S.A., Inc. Self-attention deep neural network for action recognition in surveillance videos
CN109241536A (en) * 2018-09-21 2019-01-18 浙江大学 It is a kind of based on deep learning from the sentence sort method of attention mechanism
US20190139541A1 (en) * 2017-11-08 2019-05-09 International Business Machines Corporation Sensor Fusion Model to Enhance Machine Conversational Awareness
CN109801635A (en) * 2019-01-31 2019-05-24 北京声智科技有限公司 A kind of vocal print feature extracting method and device based on attention mechanism
CN110211574A (en) * 2019-06-03 2019-09-06 哈尔滨工业大学 Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism
CN110334339A (en) * 2019-04-30 2019-10-15 华中科技大学 It is a kind of based on location aware from the sequence labelling model and mask method of attention mechanism
US20200043508A1 (en) * 2018-08-02 2020-02-06 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for a triplet network with attention for speaker diarization

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10089556B1 (en) * 2017-06-12 2018-10-02 Konica Minolta Laboratory U.S.A., Inc. Self-attention deep neural network for action recognition in surveillance videos
US20190139541A1 (en) * 2017-11-08 2019-05-09 International Business Machines Corporation Sensor Fusion Model to Enhance Machine Conversational Awareness
US20200043508A1 (en) * 2018-08-02 2020-02-06 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for a triplet network with attention for speaker diarization
CN109241536A (en) * 2018-09-21 2019-01-18 浙江大学 It is a kind of based on deep learning from the sentence sort method of attention mechanism
CN109801635A (en) * 2019-01-31 2019-05-24 北京声智科技有限公司 A kind of vocal print feature extracting method and device based on attention mechanism
CN110334339A (en) * 2019-04-30 2019-10-15 华中科技大学 It is a kind of based on location aware from the sequence labelling model and mask method of attention mechanism
CN110211574A (en) * 2019-06-03 2019-09-06 哈尔滨工业大学 Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李岚欣: "面向自然语言处理的注意力机制研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
蔡国都: "基于x-vector的说话人识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
郭佳 等: "基于全注意力机制的多步网络流量预测", 《信号处理》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11676609B2 (en) 2020-07-06 2023-06-13 Beijing Century Tal Education Technology Co. Ltd. Speaker recognition method, electronic device, and storage medium
CN111833886A (en) * 2020-07-27 2020-10-27 中国科学院声学研究所 Fully-connected multi-scale residual error network and voiceprint recognition method thereof
CN111833886B (en) * 2020-07-27 2021-03-23 中国科学院声学研究所 Fully-connected multi-scale residual error network and voiceprint recognition method thereof
CN112420057A (en) * 2020-10-26 2021-02-26 四川长虹电器股份有限公司 Voiceprint recognition method, device and equipment based on distance coding and storage medium
CN112634880A (en) * 2020-12-22 2021-04-09 北京百度网讯科技有限公司 Speaker identification method, device, equipment, storage medium and program product
CN113299295A (en) * 2021-05-11 2021-08-24 支付宝(杭州)信息技术有限公司 Training method and device for voiceprint coding network
CN113299295B (en) * 2021-05-11 2022-12-30 支付宝(杭州)信息技术有限公司 Training method and device for voiceprint coding network
CN113658355A (en) * 2021-08-09 2021-11-16 燕山大学 Deep learning-based authentication identification method and intelligent air lock
CN116072125A (en) * 2023-04-07 2023-05-05 成都信息工程大学 Method and system for constructing self-supervision speaker recognition model in noise environment
CN116072125B (en) * 2023-04-07 2023-10-17 成都信息工程大学 Method and system for constructing self-supervision speaker recognition model in noise environment

Also Published As

Publication number Publication date
CN111145760B (en) 2020-06-30

Similar Documents

Publication Publication Date Title
CN111145760B (en) Method and neural network model for speaker recognition
CN108447490B (en) Voiceprint recognition method and device based on memorability bottleneck characteristics
Lokesh et al. An automatic tamil speech recognition system by using bidirectional recurrent neural network with self-organizing map
US11170788B2 (en) Speaker recognition
Sarangi et al. Optimization of data-driven filterbank for automatic speaker verification
CN108281146B (en) Short voice speaker identification method and device
CN111429948B (en) Voice emotion recognition model and method based on attention convolution neural network
WO2019237519A1 (en) General vector training method, voice clustering method, apparatus, device and medium
CN112053695A (en) Voiceprint recognition method and device, electronic equipment and storage medium
US20210217431A1 (en) Voice morphing apparatus having adjustable parameters
Panchapagesan et al. Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC
Ajmera et al. Fractional Fourier transform based features for speaker recognition using support vector machine
CN112053694A (en) Voiceprint recognition method based on CNN and GRU network fusion
CN112735435A (en) Voiceprint open set identification method with unknown class internal division capability
US20210193159A1 (en) Training a voice morphing apparatus
Kim et al. Speaker-adaptive lip reading with user-dependent padding
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
JP2015175859A (en) Pattern recognition device, pattern recognition method, and pattern recognition program
CN113299295B (en) Training method and device for voiceprint coding network
Матиченко et al. The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space
Nijhawan et al. Speaker recognition using support vector machine
Mohammadi et al. Weighted X-vectors for robust text-independent speaker verification with multiple enrollment utterances
Arora et al. An efficient text-independent speaker verification for short utterance data from Mobile devices
Zi et al. Joint filter combination-based central difference feature extraction and attention-enhanced Dense-Res2Block network for short-utterance speaker recognition
Sas et al. Gender recognition using neural networks and ASR techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant