CN111145760A

CN111145760A - Method and neural network model for speaker recognition

Info

Publication number: CN111145760A
Application number: CN202010256078.5A
Authority: CN
Inventors: 王志铭; 姚开盛; 李小龙
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-05-12
Anticipated expiration: 2040-04-02
Also published as: CN111145760B

Abstract

Embodiments of the present description provide a method and neural network model for speaker recognition. According to the method, firstly, the frequency spectrum characteristics of the audio frequency segment of the speaker are obtained; and then, coding the spectrum characteristics to obtain a vector sequence consisting of N coding vectors at the frame level. Then, respectively applying K kinds of pooling treatment to the vector sequence to obtain corresponding K sub-embedded vectors; the arbitrary ith pooling process comprises the steps of determining attention coefficients of any first encoding vector in the N encoding vectors based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoding vectors by taking the attention coefficients of the encoding vectors as weight factors. Then, based on the K sub-embedding vectors, determining a total embedding vector; and based on the total embedded vector, speaker recognition is performed.

Description

Method and neural network model for speaker recognition

Technical Field

One or more embodiments of the present description relate to the field of machine learning, and more particularly, to methods and neural network models for speech recognition, among other things.

Background

Voiceprints are acoustic features extracted based on the spectral features of a speaker's voice wave. Like a fingerprint, a voiceprint, as a biometric, can reflect the personality and identity information of a speaker. Voiceprint recognition, also known as speaker recognition, is a biometric authentication technique that utilizes specific speaker information contained in a speech signal to automatically recognize the identity of a speaker. The biological authentication technology has wide application prospect in various fields and scenes such as identity authentication, safety verification and the like.

Speaker recognition systems and neural network models have been proposed for authentication and verification, which generally extract feature vectors from speaker audio that express the characteristics of the speaker's voice, and perform speaker recognition based on the feature vectors. However, the recognition accuracy of the existing schemes still needs to be improved.

It is desirable to have an improved scheme that can more effectively acquire feature vectors reflecting the speaking characteristics of a speaker, thereby further improving the accuracy of speaker recognition.

Disclosure of Invention

One or more embodiments of the present specification describe a method and neural network model for speaker recognition, wherein a global multi-head attention mechanism is used for vector aggregation to better capture the speech segment utterance characteristics of a speaker and improve the accuracy of speaker recognition.

According to a first aspect, there is provided a method of speaker recognition, comprising:

acquiring the frequency spectrum characteristics of the audio segments of the speaker;

coding the frequency spectrum characteristics to obtain a vector sequence consisting of N coding vectors at a frame level;

respectively applying K kinds of pooling treatment to the vector sequences to obtain corresponding K sub-embedded vectors; wherein, the ith pooling process in the K pooling processes includes, for any first encoded vector in the N encoded vectors, determining an attention coefficient of the first encoded vector based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoded vectors by using the attention coefficients of the encoded vectors as weighting factors; wherein K is an integer greater than 1;

determining a total embedding vector based on the K sub-embedding vectors;

and carrying out speaker identification based on the total embedded vector.

In different embodiments, the spectral features may comprise mel-frequency spectral cepstral coefficients MFCC features, or mel-scale filter bank FBank features.

In one embodiment, a fully connected feedforward neural network is used to encode the spectral features to obtain the N code vectors. In another embodiment, the vector sequence is obtained by performing convolution processing on the spectral features by using a plurality of convolution kernels.

In a specific embodiment, the ith pooling process specifically includes obtaining a self-attention score of the first coded vector according to a point product of the ith attention vector corresponding to the ith pooling process and the first coded vector; according to the self-attention score, an attention coefficient of a first coding vector is determined such that the attention coefficient positively correlates with the self-attention score of the first coding vector.

In another embodiment, the ith pooling process specifically includes applying a linear transformation and a nonlinear activation function to the first encoded vector sequentially to obtain a first transformed vector; obtaining a self-attention score of the first coding vector according to the point multiplication of the ith attention vector and the first transformation vector; according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.

In one embodiment, the above method is performed using a neural network model, the ith attention vector being determined by training the neural network model.

In a further particular embodiment, the ith pooling process particularly comprises determining a self-attention score for the first coded vector according to a self-attention scoring function; dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score; according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.

In a specific embodiment, the total embedded vector is obtained by concatenating the K sub-embedded vectors.

According to a second aspect, there is provided a neural network model for speaker recognition, comprising:

the input layer is used for acquiring the frequency spectrum characteristics of the audio segments of the speaker;

the coding layer is used for coding the frequency spectrum characteristics to obtain a vector sequence consisting of N coding vectors at a frame level;

the pooling layer comprises K pooling units and is used for respectively applying K pooling treatments to the vector sequences to obtain corresponding K sub-embedded vectors; the ith pooling unit in any of the K pooling units is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weighting factors; wherein K is an integer greater than 1;

a fusion layer for determining a total embedding vector based on the K sub-embedding vectors;

and the classification layer is used for carrying out speaker identification based on the total embedded vector.

According to a third aspect, there is provided an apparatus for speaker recognition, comprising:

the input module is configured to acquire the frequency spectrum characteristics of the speaker audio fragment;

the coding module is configured to code the spectrum characteristics to obtain a vector sequence formed by N coding vectors at a frame level;

the pooling module comprises K pooling sub-modules and is configured to apply K pooling treatments to the vector sequences respectively to obtain corresponding K sub-embedded vectors; the ith pooling submodule in any of the K pooling submodules is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weight factors; wherein K is an integer greater than 1;

a fusion module configured to determine a total embedded vector based on the K sub-embedded vectors;

and the classification module is configured to perform speaker recognition based on the total embedded vector.

According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the method of the first or second aspect.

According to the method and the neural network model provided by the embodiment of the specification, after the coding vectors at the frame level are obtained through coding, the coding vectors at each frame level are aggregated into an embedded vector by using an attention mechanism and adopting a global multi-head attention pooling mode. In a further embodiment, the precision or resolution of each attention head can be set, so that a pooling mode of multi-resolution multi-attention heads is formed. The total embedded vector obtained by the method can more comprehensively reflect the characteristics of the speech segment level of the speaker. Furthermore, speaker recognition based on the total embedded vector can achieve higher recognition accuracy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

FIG. 2 illustrates a schematic structural diagram of a neural network model for speaker recognition, according to one embodiment of the present description;

FIG. 3 shows a graph of attention coefficients as a function of different accuracy coefficients;

FIG. 4 illustrates a flow diagram of a method of speaker recognition, according to one embodiment;

FIG. 5 shows a schematic block diagram of an apparatus for speaker recognition, according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. First, a speaker speaks to form a speaker audio. The speaker audio is input to the spectrum extraction unit, from which the basic spectral features are extracted. Such spectral features are input into a neural network model. In the embodiment of fig. 1, the neural network model first processes a sequence of coding vectors at the frame level, i.e., N coding vectors corresponding to the frame level in the audio, based on the input spectral features. The sequence of encoded vectors at the frame level is then aggregated to obtain an embedded vector that can express utterance and speech characteristics at the speech segment (utterance) level. Then, the neural network model predicts a speaker identification (id) based on the embedded vector, and performs speaker recognition.

As mentioned above, the feature vector on which speaker recognition is based, i.e. the above embedded vector in the embodiments of the present specification, is crucial to the accuracy of voiceprint recognition. It is expected that the embedded vector can fully reflect the speaking voice characteristics of the speaker, so that higher recognition accuracy can be obtained. In the embodiment of the present specification, an attention mechanism is utilized, and a global multi-head attention pooling manner is adopted to aggregate N frame-level coding vectors into one embedded vector, so that the embedded vector comprehensively reflects the characteristics of the speaker speech segment level. The process of encoding, pooling, and identifying the audio spectrum is described below in conjunction with the structure of the neural network model.

FIG. 2 illustrates a schematic structural diagram of a neural network model for speaker recognition, according to one embodiment of the present description. It will be appreciated that the neural network model may be implemented by any suitable computer program code language. As shown in fig. 2, the neural network model can be implemented as a deep neural network as a whole, which includes at least an input layer 21, an encoding layer 22, a pooling layer 23, a fusion layer 24, and a classification layer 25, wherein the input layer 21 is used to obtain the spectral features of the speaker audio segment and input the spectral features into the encoding layer 22; the coding layer 22 codes the spectral features to obtain N code vectors h corresponding to the frame level₁,h₂,…,h_N(ii) a The pooling layer 23 comprises K pooling units, each of which applies an attention-based pooling process to the N code vectors resulting in a sub-embedded vector. Thus, pooling layer 23 outputs K sub-embedded vectors e₁,e₂,…,e_K. The fusion layer 24 fuses the K sub-embedding vectors to obtain a total embedding vector E. Then, the classifier 25 performs classification prediction based on the embedding vector E, and outputs a speaker recognition result.

The processing of the above layers is described in detail below.

First, the input layer 21 acquires spectral characteristics of a speaker's audio clip. In one embodiment, the speaker audio segment is inputted into a spectrum extraction unit which is added outside the model, and the spectrum extraction unit extracts the basic spectrum feature from the speaker audio and then inputs the extracted basic spectrum feature into the input layer 21 of the neural network model. In another embodiment, the input layer 21 is designed with a spectrum extraction function. In such a case, the speaker audio may be directly input to the input layer, and the spectral feature may be obtained by performing feature extraction on the input layer.

In one embodiment, the spectral feature is a Mel-frequency cepstral coefficient feature (MFCC). The mel frequency is extracted based on the auditory characteristics of human ears, and is in a nonlinear corresponding relation with the Hz frequency. Extracting MFCC features from speaker audio typically includes the steps of: pre-emphasis, framing, windowing, fourier transform, mel filter bank, Discrete Cosine Transform (DCT), etc. Wherein the pre-emphasis is used to boost the high frequency part to a certain extent, so that the frequency spectrum of the signal becomes flat; the framing is used for dividing the voice into a series of frames according to time; the windowing step is to use a window function to increase the continuity of the left and right ends of the frame. Then, the audio is fourier-transformed, thereby converting the time-domain signal into a frequency-domain signal. Then, the frequency of the frequency domain signal is mapped to the mel scale by using the mel filter bank, thereby obtaining the mel frequency spectrum. Then, a cepstrum coefficient of the mel frequency spectrum is obtained through discrete cosine transform, and then the cepstrum mel frequency spectrum can be obtained.

In another embodiment, the spectral signature uses the Mel-scale filter Bank FBank signature. The FBank characteristic is a spectrum characteristic obtained by mapping the frequency of a frequency domain signal to a mel scale using a mel filter bank. In other words, the MFCC signature is further discrete cosine transformed on the basis of the FBank signature, which is the signature of the MFCC prior to the discrete cosine transform.

In yet another embodiment, the spectral feature may include a linear predictive coding (lpc) feature or a perceptual linear predictive (plp) feature. These features can be extracted by conventional methods. It is also possible to extract other spectral features as a basis for neural network model processing, which is not specifically limited herein.

Based on the above obtained spectral features, the coding layer 22 codes the spectral features to obtain a vector sequence H formed by coding vectors corresponding to the frame level, where H can be written as [ H [₁,h₂,…,h_N]Wherein h is_tCorresponding to the spectral characteristics of the t-th frame. Since the coding layer 22 is a neural network modelType intermediate layer, the above-mentioned coded vector h_tAlso referred to as hidden vectors.

The encoding layer 22 may derive the frame-level encoding vectors in a variety of ways.

In one embodiment, the encoding layer 22 is implemented as a multi-layer perceptron, and the spectral features of each frame are processed layer by layer in a fully connected feedforward network manner to obtain the encoding vector h corresponding to each frame_t。

In another embodiment, the encoding layer 22 processes the spectral features by way of a convolution operation. Specifically, in one example, the coding layer 22 may be embodied as a multi-layer convolutional residual network, which includes a plurality of convolutional layers, each convolutional layer having a corresponding convolutional kernel, and the convolutional kernels are used to perform convolutional operation on the spectral features. The convolution kernels employed by each convolutional layer may be of the same or different size, for example, in one example, the first 2 convolutional layers each employ a 3 x 1 convolution kernel, and the next several convolutional layers each employ a 3 x 3 convolution kernel. The convolution kernels of different layers have different convolution parameters. By such a multi-layer convolution operation, the corresponding code vector h is obtained by processing for the spectral feature of each frame_t。

After obtaining the N code vectors in various ways, the code layer 22 then outputs a vector sequence H formed by the N code vectors to the pooling layer 23, and the pooling layer 23 pools and aggregates the N code vectors.

As shown in fig. 2, the pooling layer 23 includes K pooling units, each of which determines an attention coefficient for each encoded vector based on an attention mechanism, and aggregates the vector sequences into corresponding sub-embedded vectors based on the attention coefficients. Thus, a pooling unit may also be referred to as an attention head (attention head). In this manner, pooling layer 23 obtains multiple sub-embedding vectors using multiple attention heads. The specific polymerization process of any one of the pooling units, the ith pooling unit, is described below.

For any one code vector h in the N code vectors_tThe ith pooling unit determines the encoding vector h according to the ith attention algorithm corresponding to the pooling unit_tAttention coefficient α of_tAnd then taking the attention coefficient of each of the N coded vectors as a weight factor, and summing the coded vectors to obtain the corresponding sub-embedded vector. The ith attention algorithm corresponding to the ith pooling unit may include using a self-attention scoring function to encode the vector h based on the corresponding ith attention vector_tScoring to obtain a coding vector h_tSelf-attention score of (a); determining a coding vector h based on the self-attention score_tSuch that the attention coefficient is positively correlated with the self-attention score. Wherein, the ith attention vector V corresponding to the ith attention unit_iMay be preset as a hyper-parameter, but more generally and more preferably, the ith attention vector V_iDetermined by training of a neural network model.

In one example, the self-attention scoring function described above may be expressed as:

(1)

wherein superscript (i) indicates corresponding to the ith pooling unit,

to represent

The transposing of (1).

Therefore, the formula (1) is equivalent to the i-th attention vector V_iAnd the code vector h_tIs dot multiplied to obtain a code vector h_tSelf-attention score of (a).

In other examples, more complex transformations may be performed based on the above, resulting in a self-attention score. In one example, the code vector h may be first encoded_tSequentially applying linear transformation and a nonlinear activation function to obtain a transformation vector; according to the transformation vector and the ith attention vector V_iDot product of (d) is obtained from the attention score. In particular, in one example, the self-attention scoring function may representComprises the following steps:

(2)

wherein the content of the first and second substances,

for matrices used for linear transformations, f is a non-linear activation function, such as a sigmoid function;

in the case of an optional offset vector,

is an optional bias parameter. The above parameters are determined by the training process of the model.

Other self-attention scoring functions, not enumerated herein, may also be derived with minor modifications based on the above equations, such as adding coefficients, adding or subtracting bias terms, and the like.

After determining the code vector h_tBased on the self-attention score, the code vector h can be determined according to the self-attention score_tThe attention coefficient of (c). The attention coefficient is used as a subsequent weighting factor for measuring the importance of the corresponding code vector in this aggregation.

According to embodiments of the present description, a code vector h may be encoded_tAttention coefficient α of_tIt is determined that the code vector h is positively correlated_tSelf-attention score of

. Specifically, in one example, the self-attention score calculated above may be directly used as the attention coefficient. In another example, the self-attention scores of the N code vectors are normalized, and the normalized ratio is taken as the attention coefficient.

For example, in one example, vector h is encoded using scale normalization_tAttention coefficient α of_tCan be expressed as:

(3)

in another example, normalization is performed using a softmax function to encode a vector h_tAttention coefficient α of_tCan be expressed as:

(4)

on the basis of determining the attention coefficient of each code vector, the attention coefficient of each code vector can be used as a weighting factor to sum up the code vectors, so as to obtain the sub-embedded vector e corresponding to the ith pooling unit_iNamely:

(5)

it is assumed that each encoded vector is a d-dimensional vector, and the resulting sub-embedded vectors are also d-dimensional vectors.

It can be understood that each pooling unit (i.e. attention head) assigns different attention coefficients to each coded vector in the above manner (only the attention vectors in each pooling unit are different), and then aggregates each coded vector according to the attention coefficients to obtain corresponding sub-embedded vectors, so that K pooling units obtain K sub-embedded vectors respectively. Therefore, the pooling layer 23 uses a multi-head attention method to obtain a plurality of sub-embedding vectors through a plurality of attention-based aggregation methods.

In contrast to some techniques in which a segment of a coded vector is an analysis target, in the multiple attention heads of the embodiments of the present specification, an attention coefficient is determined for a coded vector at a frame level, and then weighted aggregation is performed on each coded vector, which takes into account speech information of a whole frame, and this may be referred to as a global multi-head attention method. By the method, the speaker characteristic information at the speech segment level can be better acquired.

Further, in one embodiment, on the basis of multi-head attention, different precision coefficients T are set for each attention head_iTo adjust the accuracy of the attention coefficient or to be referred to as "resolution". This coefficient may be referred to as a temperature coefficient by reference to terms used in the knowledge of distillation and annealing. Temperature coefficient T_iThe self-attention score can be adjusted in inverse proportion and introduced into the determination process of the attention coefficient, so that the resolution of the attention coefficient is adjusted.

In particular, in one embodiment, the self-attention score is divided by an ith precision coefficient T preset for an ith pooling unit_iObtaining an adjustment score; then, according to the adjustment score, an attention coefficient of the code vector is determined.

When the attention coefficient is determined using equation (4), the attention coefficient after introducing the temperature coefficient correction can be expressed as:

(6)

fig. 3 shows a graph of the attention coefficient as a function of different accuracy coefficients. In the leftmost plot of fig. 3, temperature coefficient T =1, it can be seen that the attention coefficient curve transitions very steeply, and the attention coefficient is very sensitive to changes with the similarity score. When the temperature coefficient T rises to 20, the curve changes gently as shown in the middle graph. And when the temperature coefficient rises to 30, as shown in the rightmost graph, the curve changes more gradually. The flatter the curve, the less sensitive the attention coefficient changes with the self-attention score, i.e. the lower the accuracy or resolution of the attention coefficient. It will be appreciated that as the temperature coefficient increases, the resolution of the attention coefficient will decrease. When the temperature coefficient approaches infinity, the attention coefficient is a constant (1/N) and does not change with the similarity score, and the resolution of the attention coefficient is 0.

In one embodiment, different temperature coefficients T may be set in advance for the respective pooling units, thereby forming the pooling layer 23 of the multi-resolution multi-focus head. Each pooling unit in the pooling layer 23 performs attention aggregation on the N encoded vectors using the corresponding temperature coefficient and attention algorithm to obtain corresponding sub-embedded vectors. Then, the pooling layer 23 gets K sub-embedded vectors through K pooling units.

Next, the K sub-embedding vectors are input to the fusion layer 24. The blending layer 24 may blend the K sub-embedding vectors into a total embedding vector E.

Specifically, in one embodiment, the fusion layer 24 splices the K sub-embedded vectors to obtain a total embedded vector E. Assuming that the encoding vector and sub-embedding vector dimensions are d, the total embedding vector dimension thus obtained is K x d. In other examples, the fusion layer 24 may also use other fusion operations to obtain the total embedded vector E, e.g., K sub-embedded vectors may be summed, multiplied bitwise, and so on.

The classification layer 25 may then perform speaker recognition based on the total embedded vector obtained as described above.

In one particular example, the classification layer 25 may include several fully-connected sublayers that may further process the total embedded vector and an output layer that classifies the resulting vector, for example, using a softmax function. The classification result may specifically be a speaker id, or whether the classification result is a binary classification result of a certain speaker. In this manner, the classification layer 25 may output the results of speaker recognition.

Reviewing the structure and the processing process of the neural network model, it can be seen that after the coding layer obtains the coding vectors at the frame level, the pooling layer aggregates the coding vectors at each frame level into an embedded vector by using an attention mechanism and adopting a global multi-head attention pooling mode. In a further embodiment, the accuracy or resolution of each attention head may also be set, thus forming a pooling layer of multi-resolution multi-attention heads. The total embedded vector obtained by the method can more comprehensively reflect the characteristics of the speech segment level of the speaker. Furthermore, speaker recognition based on the total embedded vector can achieve higher recognition accuracy.

In another aspect, a method of speaker recognition is provided. FIG. 4 illustrates a flow diagram of a method of speaker recognition, according to one embodiment. The method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 2, the method includes the following steps.

In step 41, spectral characteristics of the speaker's audio clip are obtained. In different embodiments, the spectral features may comprise mel-frequency spectral cepstral coefficients MFCC features, or mel-scale filter bank FBank features.

Then, in step 42, the spectral features are encoded to obtain a vector sequence of N encoded vectors at the frame level. In one embodiment, a fully connected feedforward neural network is used to encode the spectral features to obtain the N code vectors. In another embodiment, the vector sequence is obtained by performing convolution processing on the spectral features by using a plurality of convolution kernels.

In step 43, K kinds of pooling processing are applied to the vector sequences respectively to obtain corresponding K sub-embedded vectors; wherein, the ith pooling process in the K pooling processes includes, for any first encoded vector in the N encoded vectors, determining an attention coefficient of the first encoded vector based on an ith attention algorithm corresponding to the ith pooling process, and summing the encoded vectors by using the attention coefficients of the encoded vectors as weighting factors; wherein K is an integer greater than 1.

In a specific embodiment, the ith pooling process specifically includes obtaining a self-attention score of the first coded vector according to a point product of the ith attention vector corresponding to the ith pooling process and the first coded vector; according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.

In another embodiment, the ith pooling process specifically includes applying a linear transformation and a nonlinear activation function to the first encoded vector sequentially to obtain a first transformed vector; obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling processing and the first transformation vector; according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.

In a further particular embodiment, the ith pooling process particularly comprises determining a self-attention score of the first encoding vector according to a self-attention scoring function; dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score; according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.

After K sub-embedding vectors are obtained through K kinds of pooling, a total embedding vector is determined based on the K sub-embedding vectors in step 24. In a specific embodiment, the total embedded vector is obtained by concatenating the K sub-embedded vectors.

Then, in step 25, speaker recognition is performed based on the total embedded vector.

By the method, a multi-head attention mechanism is adopted for pooling, more effective embedded vector expression is obtained, and therefore the speaker identification accuracy is improved.

According to an embodiment of yet another aspect, an apparatus for speaker recognition is provided that may be implemented as any device, platform, or cluster of devices having data storage, computing, processing capabilities. FIG. 5 shows a schematic block diagram of an apparatus for speaker recognition, according to one embodiment. As shown in fig. 5, the speaker recognition apparatus 500 includes:

an input module 51 configured to obtain spectral characteristics of an audio segment of a speaker;

the encoding module 52 is configured to encode the spectrum features to obtain a vector sequence formed by N encoding vectors at a frame level;

the pooling module 53 comprises K pooling sub-modules configured to apply K pooling processes to the vector sequences, respectively, to obtain corresponding K sub-embedded vectors; the ith pooling submodule in any of the K pooling submodules is configured to, for any first coding vector in the N coding vectors, determine an attention coefficient of the first coding vector based on an ith attention algorithm corresponding to the ith pooling unit, and sum the coding vectors by using the attention coefficients of the coding vectors as weight factors; wherein K is an integer greater than 1;

a fusion module 54 configured to determine a total embedded vector based on the K sub-embedded vectors;

a classification module 55 configured to perform speaker recognition based on the total embedded vector. .

Through the device, efficient and accurate speaker recognition is achieved.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 4.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 4.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method of speaker recognition, comprising:

coding the frequency spectrum characteristics to obtain N coding vectors at a frame level, wherein the N coding vectors form a vector sequence;

determining a total embedding vector based on the K sub-embedding vectors;

and carrying out speaker identification based on the total embedded vector.

2. The method of claim 1, wherein the spectral features comprise mel-frequency spectral cepstral coefficient (MFCC) features, or mel-scale filter bank (FBank) features.

3. The method of claim 1, wherein encoding the spectral feature to obtain a vector sequence of N encoded vectors comprises:

and carrying out convolution processing on the frequency spectrum characteristics by utilizing a plurality of convolution kernels to obtain the vector sequence.

4. The method of claim 1, wherein determining the attention coefficient of the first coded vector based on an ith attention algorithm corresponding to the ith pooling process comprises:

obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling processing and the first coding vector;

according to the self-attention score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the self-attention score.

5. The method of claim 1, wherein determining the attention coefficient of the first coded vector based on an ith attention algorithm corresponding to the ith pooling process comprises:

applying linear transformation and a nonlinear activation function to the first coding vector sequence to obtain a first transformation vector;

obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling processing and the first transformation vector;

6. The method of claim 4 or 5, wherein the method is performed by a neural network model, the i-th attention vector being determined by training the neural network model.

7. The method of claim 1, wherein determining the attention coefficient of the first coded vector based on an ith attention algorithm corresponding to the ith pooling process comprises:

determining a self-attention score for the first encoding vector according to a self-attention scoring function;

dividing the self-attention score by an ith precision coefficient preset for the ith pooling processing to obtain an adjustment score;

according to the adjustment score, an attention coefficient of a first encoding vector is determined such that the attention coefficient is positively correlated with the adjustment score.

8. The method of claim 1, wherein determining a total embedding vector based on the K sub-embedding vectors comprises:

and splicing the K sub-embedded vectors to obtain the total embedded vector.

9. A neural network model for speaker recognition, comprising:

the coding layer is used for coding the frequency spectrum characteristics to obtain N coding vectors at a frame level, and the N coding vectors form a vector sequence;

10. The neural network model of claim 9, wherein the spectral features comprise mel-frequency spectral cepstral coefficients MFCC features, or mel-scale filter bank FBank features.

11. The neural network model of claim 9, wherein the coding layers comprise a plurality of convolutional layers that convolve the spectral features with a plurality of convolution kernels, resulting in the sequence of vectors.

12. The neural network model of claim 9, wherein the ith pooling unit is specifically configured to:

obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling unit and the first coding vector;

13. The neural network model of claim 9, wherein the ith pooling unit is specifically configured to:

obtaining the self-attention score of the first coding vector according to the point multiplication of the ith attention vector corresponding to the ith pooling unit and the first transformation vector;

14. The neural network model of claim 12 or 13, wherein the ith attention vector is determined by training the neural network model.

15. The neural network model of claim 9, wherein the ith pooling unit is specifically configured to:

16. The neural network model of claim 9, wherein the fusion layer is specifically configured to:

and splicing the K sub-embedded vectors to obtain the total embedded vector.

17. An apparatus for speaker recognition, comprising:

the coding module is configured to code the spectrum features to obtain N coding vectors at a frame level, and the N coding vectors form a vector sequence;

18. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-8.