CN115083419A

CN115083419A - Speaker recognition method and device, equipment and storage medium

Info

Publication number: CN115083419A
Application number: CN202110281559.6A
Authority: CN
Inventors: 宫帅; 童颖; 丁国宏
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2022-09-20

Abstract

The embodiment of the application discloses a speaker identification method, which comprises the following steps: acquiring audio data output by a speaker to be identified; performing framing and feature extraction on the audio data by using the trained speaker recognition model to obtain a feature vector of the audio data; optimizing and reducing dimensions of the feature vector by using the speaker recognition model to obtain an identity vector of the speaker to be recognized; the identity vector is used for distinguishing the identity of the speaker to be identified; therefore, the feature extraction of the audio data to be identified and the identification of the identity of the speaker to be identified are realized in an end-to-end mode, the frequency resolution of different frequency bands is prevented from being solidified by adopting prior information, and the robustness of the identification system is improved. The embodiment of the application also provides a speaker recognition device, equipment and a storage medium.

Description

Speaker recognition method and device, equipment and storage medium

Technical Field

The present application relates to the field of electronic device technology, and relates to, but is not limited to, speaker recognition methods and apparatuses, devices, and storage media.

Background

Speaker Recognition (SRE) is a technique for automatically recognizing the identity of a Speaker based on information in a speech signal. Speaker recognition can be further classified into speaker Identification (Speak Identification), speaker Verification (Speak Verification), speaker tracking (Speak Segmentation and Clustering), and the like, depending on the task. The speaker recognition technology is widely applied to different fields of military, government, finance and the like due to the characteristics of easy realization of man-machine interaction, remote identity verification and attack prevention.

In the existing speaker recognition system, the model training part of the embedded layer (embedding) based on the X vector mainly includes the following parts: a feature extraction section: performing corresponding processing on a time domain waveform of an audio by adopting short-time Fourier Transform (STFT) to obtain characteristics such as Mel Frequency Cepstral Coefficients (MFCC), Filter networks (Filter Bank, FBANK), Pitches (PITCH), Linear Prediction Cepstrum Coefficients (LPCC), Constant-Q Transform (CQT) and the like, wherein the obtained characteristics are used as the input of a neural network model; and a model network part, wherein the network structure of the model mainly comprises a frame level layer, a pooling layer and a segment level layer. In the test, the extracted frequency domain features are used as the input of a neural network model, and a one-hot vector characterized by the identity information of the speaker is output.

In the current method, the frequency domain features are obtained by processing after the short-time fourier transform, and the time domain window length of the short-time fourier transform is fixed, so the corresponding frequency domain resolution is also fixed. At present, a frequency domain transformation method based on constant-Q transformation is available, wherein a window with an indefinite length is used in the method, and according to the characteristics of high energy occupied by low frequency, low energy occupied by high frequency and the like, a frequency spectrum after constant-Q transformation has higher frequency resolution at low frequency and higher time resolution at high frequency. However, the method is also based on certain a priori information, and the determination of the sampling rate determines that the processing method is consistent in different frequency bands.

Disclosure of Invention

The embodiment of the application provides a speaker identification method, a speaker identification device, equipment and a storage medium, and at least solves the problem that the robustness of a speaker identification system is poor due to the fact that prior information is adopted to solidify frequency resolutions of different frequency bands.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a speaker identification method, where the method includes:

acquiring audio data output by a speaker to be identified;

performing framing and feature extraction on the audio data by using a trained speaker recognition model to obtain a feature vector of the audio data;

optimizing and reducing dimensions of the feature vector by using the speaker recognition model to obtain an identity vector of the speaker to be recognized; and the identity vector is used for distinguishing the identity of the speaker to be identified.

In a second aspect, an embodiment of the present application provides a speaker recognition apparatus, including a sample obtaining module, a feature extraction module, and an optimization dimension reduction module, where:

the sample acquisition module is used for acquiring audio data output by the speaker to be identified;

the feature extraction module is used for performing framing and feature extraction on the audio data by using the trained speaker recognition model to obtain a feature vector of the audio data;

the optimization dimension reduction module is used for performing optimization dimension reduction on the feature vector by using the speaker recognition model to obtain an identity vector of the speaker to be recognized; and the identity vector is used for distinguishing the identity of the speaker to be identified.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor implements the steps in the speaker recognition method when executing the program.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the speaker recognition method described above.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

in the embodiment of the application, firstly, audio data output by a speaker to be identified is obtained; then, performing framing and feature extraction on the audio data by using the trained speaker recognition model to obtain a feature vector of the audio data; finally, optimizing and reducing dimensions of the feature vector by using the speaker recognition model to obtain an identity vector of the speaker to be recognized; the identity vector is used for distinguishing the identity of the speaker to be identified; therefore, framing and feature extraction of original audio data are completed through the pre-trained speaker recognition model, and further optimization and dimension reduction are performed on the extracted feature vectors to obtain identity vectors representing the identity of the speaker to be recognized. Therefore, the feature extraction of the audio data to be recognized and the identification of the identity of the speaker to be recognized are realized in an end-to-end mode, and the problem that the frequency resolution of different frequency bands is solidified by adopting prior information to ensure that the robustness of a recognition system is poor is solved by using short-time Fourier transform to process the original audio data to obtain frequency domain features.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a schematic flowchart of a speaker recognition method according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a speaker recognition method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a speaker recognition method according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating a speaker recognition method according to an embodiment of the present disclosure;

FIG. 5A is a schematic diagram of a model structure of a speaker recognition method according to an embodiment of the present application;

FIG. 5B is a system diagram of a speaker recognition method according to an embodiment of the present application;

fig. 5C is a flowchart of a pre-training method based on data enhancement according to an embodiment of the present application;

FIG. 5D is a schematic diagram of a pre-training process provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a speaker ID device according to an embodiment of the present disclosure;

fig. 7 is a hardware entity diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application are only used for distinguishing similar objects and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may be interchanged under specific ordering or sequence if allowed, so that the embodiments of the present application described herein can be implemented in other orders than illustrated or described herein.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the present application belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Voice is the most direct and convenient way for human beings to communicate, and it attracts the attention of various research institutions with its advantages in various aspects such as convenience, economy and accuracy. The research of voice signal processing has great significance for promoting human-computer interaction and artificial intelligence development. For this reason, the related fields of speech signal processing, such as speech recognition, speech coding, speech synthesis, speaker recognition, etc., are receiving more and more attention and theoretical research.

Speaker recognition, also known as voiceprint recognition, aims at identity authentication based on a unique pronunciation of each speaker. The voice of each speaker has unique individual characteristics because the natural sounding organs of each speaker are different and are cultured into a unique voice under the influence of the environmental factors of the speaker in the future. Due to the difference, the voice is taken as a biological characteristic to be taken as a recognition target, and the speaker recognition gradually forms a set of relatively perfect recognition system.

The speaker recognition and speech recognition technology are very similar, and both establish corresponding reference template or model on the basis of extracting some characteristic parameters in the original speech signal, and then recognize according to certain decision rule. In the speech recognition, the speaking difference of different people is normalized as much as possible; in speaker recognition, the intention is to extract the personality factors of speakers contained in speech signals by averaging semantic information in the speech signals, and to emphasize the feature differences between different speakers.

The speaker recognition system comprises a preprocessing part, a feature extraction part and a model training and matching calculation part. The preprocessing part is used for preprocessing the voice signal and extracting the characteristics, namely extracting parameters capable of representing the characteristics of the speaker; the model sequence part comprises the establishment of a speaker model and the training of model parameters; the matching calculation section performs matching calculation of the test speech and the speaker model. It can be seen that the key techniques for speaker recognition, including feature parameter extraction algorithms, model selection and model matching algorithms, directly determine the performance of the recognition system.

The speaker recognition model is divided into a generation model and a discrimination model. The generation of the model is to learn respective characteristics of each category, namely a plurality of models, and the identification data is mapped into each model so as to determine which category the identification data belongs to; the discriminant model is a learning classification plane that can be used to distinguish which class the different data belong to, respectively. The two models are represented by an I-vector based on a global variance model (TVM) and an X-vector based on a Time-delay Neural Network (TDNN), and are two vector models which are most widely used at present.

The most advanced speaker recognition system at present is developed based on the embedding (embedding) framework of X vectors. The X vector is a mainstream basic model frame in the current voiceprint recognition field, and can accept input with any length and convert the input into feature expression with fixed length by virtue of a statistical pooling layer in a network; in addition, a data enhancement strategy containing noise and reverberation is introduced in training, so that the model is more robust to interference such as the noise and the reverberation.

The X vector comprises a plurality of layers of TDNNs at a frame level, a statistical pooling layer, two full-connected layers at a sentence level and a normalization layer (softmax); the loss function is the Cross-entropy Error (CE). Wherein, the total input of TDNN is a segment of voice, and each TDNN takes a fixed frame number. And then, after the output vector of each TDNN is accumulated by the pooling layer, calculating the mean value and the standard deviation as the output of the pooling layer, namely, taking the mean value and the standard deviation of the TDNN output of all frames of the input audio sequence, splicing the mean value and the standard deviation, and outputting the sentence-level feature expression. The pooling layer is followed by two omni-directional connection layers and finally a normalization layer. The number of neurons output by the normalization layer is consistent with the number of speakers in the sample training set, and the output can be seen to be posterior probability.

TDNN: structurally, each layer of the TDNN is still a Deep Neural Network (DNN), and only the input of each layer is spliced by historical, current and future features, so that the time sequence information is introduced; the TDNN architecture is advantageous in that it can parallelize training with respect to Long Short-Term Memory networks (LSTM), which in turn adds timing context information with respect to DNNs.

The speaker recognition system mainly comprises two stages: a training phase and a recognition phase. And in the training stage, a template or a model of each speaker is established through feature extraction according to the training corpus of each speaker in the speaker set. In the recognition stage, the voice of the speaker to be recognized is also subjected to feature extraction and is compared with a template or a model generated during system training. In speaker identification, a speaker corresponding to a model with the maximum predicted voice similarity is taken as an identification result; in speaker verification, a determination is made by determining whether the similarity between the test tone and the model of the purported speaker is greater than a threshold.

The embodiment of the application provides a speaker identification method which is applied to electronic equipment. The electronic device includes, but is not limited to, a mobile phone, a laptop, a tablet and a web-enabled device, a multimedia device, a streaming media device, a mobile internet device, a wearable device, or other types of electronic devices. The functions implemented by the method can be implemented by calling program code by a processor in an electronic device, and the program code can be stored in a computer storage medium. The processor may be used for processing the process of speaker recognition and the memory may be used for storing data required and data generated during the process of speaker recognition.

Fig. 1 is a schematic flowchart of a speaker recognition method according to an embodiment of the present application, where as shown in fig. 1, the method at least includes the following steps:

step S110, acquiring audio data output by a speaker to be identified;

here, real-time audio data output by at least one speaker to be recognized through an acoustic path can be collected, and at least one section of stored audio data can be obtained from a stored corpus; wherein the audio acquisition may be collected by means of a recording.

Step S120, framing and feature extraction are carried out on the audio data by utilizing a trained speaker recognition model to obtain feature vectors of the audio data;

here, since the speech signal is sequence-changed due to the movement of the vocal organs, the acquired audio data needs to be sliced into a segment of 20 to 30 milliseconds (ms) of signal, each of which is referred to as a frame, assuming that the signal is continuously stable.

The voice feature extraction is to extract the voice feature and vocal tract characteristic of the speaker. The difference between different speakers is mainly reflected in the difference of short-term voice spectrums. Current speaker systems transform time domain signals to the frequency domain using a short time fourier transform, which is characterized by all frequencies sharing the same frequency resolution because the time domain window is fixed. Meanwhile, the short-time spectrum features extracted in the mode are mainly used for extracting main feature parameters such as MFCC (Mel frequency cepstrum coefficient), LPCC (low power cepstrum coefficient) and the like, and are mainly based on single features, so that the information for representing the personality of a speaker is insufficient, and the identification precision is influenced. According to the method and the device, more information can be learned through a mode of extracting the features through the neural network, and the finally extracted features are more abstract and can be more capable of having speaker characterization ability.

And S130, optimizing and reducing dimensions of the feature vector by using the speaker recognition model to obtain the identity vector of the speaker to be recognized.

Here, the process of performing the optimal dimension reduction on the feature vectors by using the speaker recognition model can be realized by an X-vector-based framework.

The process of optimizing and reducing the dimension can be regular optimization between different frames and optimization of interrelation between features in the frames on the feature vectors extracted by the neural network, and mapping the feature vectors of different frames to vectors with fixed dimensionality and finally connecting the vectors into a sentence-level vector.

The identity vector is used for distinguishing the identity of the speaker to be recognized, for example, the identity vector is sent to a back-end scoring model, and the similarity between the identity vector and the registered speaker feature vector is determined so as to distinguish the identity of the speaker to be recognized. The mapping vector (embedding) output before the last normalization layer in the X-vector based framework is typically extracted for use as the identity vector for subsequent discrimination.

Illustratively, the similarity between the identity vector of the real-time audio data and the identity vector of the registered audio can be calculated through a general scoring method, such as Probabilistic Linear Discriminant Analysis (PLDA), and the judgment of the speaker recognition result can be performed according to the similarity and the set similarity threshold.

In some possible embodiments, the trained speaker recognition model includes a feature learning network and an optimized dimension reduction network, fig. 2 is a flowchart of a speaker recognition method provided in the embodiments of the present application, and as shown in fig. 2, the method at least includes the following steps:

step S210, acquiring audio data output by a speaker to be identified;

step S220, utilizing a feature learning network to frame the audio data to obtain at least two frames of audio;

here, the framing process is to segment the audio data according to a specified length (time period or number of samples) and structure the audio data into a smooth data structure.

The implementation manner of using the feature learning network to frame the audio data may be any manner possible in the related art, for example, by windowing and setting frame shift, which is not limited in the embodiment of the present application.

Step S230, extracting the features of each frame of audio by using a feature learning network according to a specific expansion scale to obtain a feature vector of the audio data;

here, the feature learning network is a network in a trained speaker recognition model, that is, in the embodiment of the present application, a neural network is used to learn features with speaker characterization capability, so that feature extraction is directly performed on original speech data in an end-to-end manner.

The specific expansion scale represents the step length adopted during feature extraction, and different expansion scales reflect different receptive fields. Generally, the larger the receptive field, the more abstract the extracted features. Layers with different expansion scales are arranged in the feature learning network, and the expansion scales of the layers are gradually increased along with the deepening of the layers.

In some possible embodiments, the above-mentioned process of feature extraction for each frame of the audio at a specific scale of expansion is implemented as follows: the feature learning network comprises at least a first sublayer and a second sublayer; performing feature extraction on each frame of audio by using the first sublayer according to a first expansion scale to obtain a first vector; performing feature extraction on each frame of audio by using the second sublayer according to a second expansion scale to obtain a second vector; and connecting the first vector and the second vector to obtain the characteristic vector of the audio data.

Here, the first sublayer and the second sublayer are hidden layers in the feature learning network, and the second expansion scale corresponding to the second sublayer is larger than the first expansion scale corresponding to the first sublayer. The feature vectors extracted from each sub-layer are spliced together according to the weight to form the feature vectors.

It should be noted that the hidden layer does not directly receive external signals, nor directly send signals to the outside. The hidden layer acts as a middle black box in the neural network and can be considered as a general term for many other different functional layers.

It should be noted that the selection of the hidden layer in the feature learning network may be flexibly selected, that is, an adjacent layer may be arranged between the first sublayer and the second sublayer, or at least one sublayer may be spaced between the first sublayer and the second sublayer. That is, the feature learning network includes several hidden layers, and the second expansion scale is larger than the first expansion scale.

In some possible embodiments, the expansion scale of adjacent sub-layers in the feature learning network increases by a power of 2; wherein N is an integer of 0 or more. That is, assuming that the first sublayer and the second sublayer are adjacent sublayers, the first sublayer is used to have a 2-th order ^N-1 The step size of (2) is used for extracting the features of the audio of each frame, and the second sublayer is used for extracting the features according to 2 ^N The step size of (2) performs feature extraction on the audio of each frame.

For example, in the case where N is 1, the first sublayer extracts features at 1, 2, 3, 4, 5, 6, 7, 8, 9, etc. time domain sample points of each frame of audio in step 1; the second sublayer extracts features at time domain sample points of 1, 3, 5, 7, 9, etc. per frame in step 2. Therefore, the scale is enlarged layer by layer, the receptive field is enlarged, the window length of each layer is sequentially increased, the frequency resolution is sequentially reduced, and more information can be learned through multilayer processing.

And S240, carrying out optimized dimensionality reduction on the feature vector by using an optimized dimensionality reduction network to obtain the identity vector of the speaker to be identified.

Here, the optimized dimensionality reduction network at least comprises a frame level layer, a pooling layer and a segment level layer which are connected in sequence; wherein: the frame level layer is used for performing rule optimization among different frames and optimization of interrelation among the characteristics in the frames; the pooling layer is realized by two fully-connected layers and is used for mapping the output feature vector of the frame level layer to a vector with fixed dimensionality through attention pooling (attention pooling) or statistical pooling (static pooling); and the segment level layer is used for transmitting the output of the pooling layer to the normalization layer after passing through the N layers of neural networks so as to judge the identity of the speaker.

In one possible embodiment, the above-described optimized dimension reduction process may be implemented by the following process: optimizing the feature vectors by utilizing the frame level layer to obtain the feature vectors at the frame level; mapping the frame-level vectors to fixed-dimension feature vectors using the pooling layer; connecting the feature vectors of the fixed dimensionality by using the segment level layer to obtain segment level feature vectors; and taking the feature vector of the segment level as the identity vector of the speaker to be identified.

In the embodiment of the application, the layers with different expansion scales are arranged in the feature learning network in the speaker recognition model, the scale is expanded layer by layer, the receptive field is expanded, the window length of each layer is sequentially increased, and the frequency resolution is sequentially reduced, so that more information can be learned through multi-layer processing. Meanwhile, the feature vectors extracted by the feature learning network are subjected to dimension reduction processing through an optimized dimension reduction network in the speaker recognition model, and the identity vectors of the speakers to be recognized are determined for identity discrimination of the subsequent scoring model.

Fig. 3 is a schematic flowchart of a speaker recognition method according to an embodiment of the present application, and as shown in fig. 3, a training process of the speaker recognition model at least includes the following steps:

step S310, obtaining a voice corpus sample marked with the identity of a speaker;

here, supervised training is performed through annotated speech corpus samples. With supervised is meant a machine learning task that infers a function from a labeled training dataset.

Step S320, acquiring the constructed speaker recognition model;

here, the constructed speaker recognition model includes an initial feature learning network and an initial optimized dimension reduction network.

Step S330, model parameters of the trained feature extraction network are adopted, and weight parameters of the initial feature learning network are initialized;

here, the feature extraction network and the feature learning network in the speaker recognition model have the same network structure, and the model parameters of the feature extraction network are determined by performing self-supervision training in advance using an unidentified speech corpus. Therefore, after the training of the feature extraction network reaches the convergence state, the feature learning network in the constructed speaker recognition model can be directly initialized to train the speaker recognition model.

The pre-trained feature extraction network is used for initializing the feature learning network in the time domain speaker recognition system, so that the network can learn more robust speaker features.

Step S340, training the initial feature learning network and the initial optimized dimensionality reduction network by using the voice corpus sample to obtain the speaker recognition model.

The training process is an end-to-end mode, based on a supervised loss function, a voice corpus sample is used as input, firstly, a feature vector of the voice corpus sample is extracted through a feature learning network, then, the feature vector extracted by the feature learning network is used as the input of the optimized dimension reduction network, and finally, an identity vector representing the identity of the speaker is output through the optimized dimension reduction network.

Fig. 4 is a schematic flowchart of a speaker recognition method according to an embodiment of the present application, and as shown in fig. 4, the process of the pre-training feature extraction network may be implemented by the following steps:

step S410, obtaining at least two types of original audio samples without labels;

here, the training is performed in an unsupervised manner by means of unlabelled raw audio samples. The self-supervision means that the model directly learns by itself from the label-free data without marking the data.

Step S420, obtaining a pseudo audio sample corresponding to each type of the original audio sample;

here, the unlabeled original audio samples are first processed by a data enhancement module to perform reverberation, noise, speed change, masking, spectrum enhancement, etc. to generate corresponding pseudo audio samples.

The at least two types of original audio samples without labels comprise a first type of original audio output by the same speaker and a second type of original audio output by different speakers; in some possible implementation manners, performing data enhancement processing on the first type of original audio to obtain pseudo audio data of the same speaker as the first type of original audio; in other possible implementation manners, the second type of original audio is subjected to data enhancement processing to obtain pseudo audio data of a speaker different from the first type of original audio. In the actual training process, at least one implementation mode can be selected according to the service requirement. The embodiments of the present application do not limit this.

Step S430, performing self-supervision training on an initial feature extraction network based on contrast loss by using each type of the original audio samples and the pseudo audio samples;

here, the contrast loss (contrast loss) is a loss function, which can well express the matching degree of the pair samples, i.e., the original audio sample and the pseudo audio sample, and can also be well used for training the model for extracting the features.

It should be noted that the value of the contrast loss function depends on the euclidean distance between two sample features and the set threshold. When the original similar samples are larger in the Euclidean distance in the feature space, the larger the loss function (increasing function) is, and the current model is not good. When the samples are dissimilar, the loss function value becomes larger if the euclidean distance in the feature space is smaller.

The pre-training uses the generated pseudo audio samples together with the original audio samples to feed into the feature extraction network. Self-supervised training is performed based on contrast loss. Stopping the pre-training when the loss decreases to a threshold. Therefore, the identification accuracy rate under the condition of large channel difference or uneven data distribution is improved.

Step S440, responding to the situation that the contrast loss is less than or equal to a specific threshold value, determining model parameters of the feature extraction network, and obtaining the trained feature extraction network.

Here, the training is repeated until the value of the contrast loss is less than or equal to a specific threshold, which indicates that the feature extraction network model has been trained.

The speaker identification method is described below with reference to an embodiment, but it should be noted that the embodiment is only for better describing the present application and is not to be construed as limiting the present application.

In order to fully utilize the learning and characterization capabilities of the neural network, the embodiment of the application provides a time-domain speaker recognition system, and the relation of time-domain waveform audio is learned through the neural network. Also, a pre-training approach is proposed to reduce data non-uniformity or channel variation.

In order to solve the problem that prior information is utilized to solidify frequency resolutions of different frequency bands in the current speaker system, an end-to-end mode is provided in the embodiment of the application to train the time domain speaker recognition system based on an X vector frame. Meanwhile, the embodiment of the application adopts a pre-training method to construct a large amount of pseudo data (equivalent to pseudo audio samples) for learning the influence of different channels, and then initializes the characteristic learning layer in the time domain speaker recognition system by using a pre-trained model, so that a neural network learns more robust speaker characteristics.

In order to fully utilize the learning and characterization capabilities of the neural network, the embodiment of the application provides a time-domain speaker recognition system, and the relation of time-domain waveform audio is learned through the neural network. The embodiment of the application provides a time domain speaker recognition system, the structure part of which mainly comprises two parts: a model training part and a back-end scoring part. The embodiment of the present application mainly explains the model training portion, and the back-end scoring portion is the same as the implementation in the related art, and will not be described in the embodiment of the present application.

Fig. 5A is a schematic diagram of a model structure of the speaker recognition method according to the embodiment of the present application, and it can be seen that the model structure mainly includes a feature learning layer 51 (equivalent to a feature learning network), a frame level layer 52, a pooling layer 53, and a segment level layer 54.

The feature learning layer 51, as shown in fig. 5B, includes several hidden layers in the feature learning layer 51, and the embodiment of the present application proposes a layer with different expansion scales, and as the hidden layer grows deeper, the expansion scale of the hidden layer gradually increases, for example, the expansion scale of the hidden layer 1, the hidden layer 2 to the hidden layer n is 20, 2 in sequence ¹ ，2 ² ，…,2 ^n-1 I.e. the deeper the layer the greater its field of reception. Here, each layer other than the input layer and the output layer is called a hidden layer, the selection of the hidden layer can be flexibly selected, and the deeper the layer, the more abstract the extracted features. The feature vectors 511 finally obtained by the feature learning layer 51 are combined by connecting the weight vectors of the hidden layers.

Frame level layer 52: the layer can learn the information of the frame level by adopting a network structure such as a residual error network (ResNet)/a self-attention transformation network (Transformer)/a transformation network based on convolution enhancement (Transformer)/a Convolution Neural Network (CNN)/a Recurrent Neural Network (RNN).

The pooling layer 53: the output of the frame-level layer is subjected to attention pooling (attention pooling) or statistical pooling (statistical pooling), and the vector of the frame-level layer is mapped to a vector with fixed dimensions.

Segment level layer 54: and the output of the pooling layer is sent to a normalization layer after passing through N layers of neural networks to carry out speaker identity discrimination.

Compared with the existing speaker recognition system, the newly proposed system does not need feature extraction, and can learn the features with speaker characterization ability by using a neural network during model training.

Aiming at the situations that a large number of channels are not matched, label data are seriously insufficient or are distributed unevenly in the conventional speaker system, the embodiment of the application provides a pre-training method based on data enhancement. As shown in fig. 5C, the pre-training method includes the following steps:

step S501, data enhancement is carried out on the original audio sample without the label, and a pseudo audio sample is generated.

The unlabeled original audio samples are first passed through a data enhancement module to generate corresponding pseudo audio samples. The data is classified, the original audio samples and the corresponding generated pseudo audio samples are classified into the same class, and the original audio samples are regarded as different classes.

Data enhancement of a large number of unlabeled raw audio samples includes, but is not limited to: adding reverberation, noise, speed change, masking, frequency spectrum enhancement and the like to obtain a large number of pseudo audio samples.

In the model training data set, the distribution of the audio data is often not very uniform, and the non-uniform distribution refers to the influence caused by the difference of the acoustic paths (i.e. channels). Because it is possible that the model training data set is mostly clean speech, if the tested speech is speech in a noisy environment, it is possible that the speaker's discriminative power for recognizing the model is not good. In model training, if the training data can contain data of more scenes, the robustness of the training data is improved.

Step S502, pre-training the feature extraction network 55 by using the original audio sample and the pseudo audio sample.

Here, the network structure of the feature extraction network is the same as that of the above-described feature learning layer 51. The generated pseudo audio samples are used to be fed into a feature extraction network together with the original audio samples. As shown in fig. 5D, data enhancement is performed on the candidate audio 11 to obtain a positive pseudo audio 12 and a negative pseudo audio 12, and then the candidate audio 11 and the positive pseudo audio 12 are input to the feature extraction network 55 for training, or the candidate audio 11 and the negative pseudo audio 13 are input to the feature extraction network 55 for training. Thereby reducing data non-uniformity or channel variation by means of pre-training.

During the pre-training (pretrain), the self-supervised training is carried out based on the contrast loss, and the pre-training is stopped when the loss is reduced to a threshold value. The contrast loss is a loss function of the self-supervision pre-training, is used for representing the matching degree between paired samples, and can also be well used for training a model for extracting features.

In step S503, the weight parameters of the feature learning layer 51 are initialized using the model parameters of the feature extraction network 55.

Here, the model parameters of the pre-trained feature extraction network 55 are initialized to the weight parameters of the feature learning layer 51 in fig. 5A. Thereby performing subsequent model training of the speaker recognition system.

The time-domain speaker recognition system provided by the embodiment of the application can fully utilize the strong learning capacity of the neural network to learn the characteristics with more speaker characterization capacity; meanwhile, the embodiment of the application provides a pre-training method based on data enhancement, which is used for improving the identification accuracy rate under the condition of large channel difference or uneven data distribution.

Based on the foregoing embodiments, there is provided a speaker recognition device, where the speaker recognition device includes modules and units included in the modules, and can be implemented by a processor in an electronic device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the Processor may be a Central Processing Unit (CPU), a microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Fig. 6 is a schematic structural diagram of a speaker recognition apparatus according to an embodiment of the present application, and as shown in fig. 6, the speaker recognition apparatus includes a sample obtaining module 610, a feature extracting module 620, and an optimization dimension reducing module 630, where:

the sample acquiring module 610 is configured to acquire audio data output by a speaker to be identified;

the feature extraction module 620 is configured to perform framing and feature extraction on the audio data by using a trained speaker recognition model to obtain a feature vector of the audio data;

the dimension optimizing module 630 is configured to perform dimension optimizing on the feature vector by using the speaker recognition model to obtain an identity vector of the speaker to be recognized; and the identity vector is used for distinguishing the identity of the speaker to be identified.

In some possible embodiments, the trained speaker recognition model includes a feature learning network and an optimized dimension reduction network, and the feature extraction module includes a framing submodule and an extraction submodule, wherein: the framing submodule is used for framing the audio data by utilizing the feature learning network to obtain at least two frames of audio; the extraction submodule is used for extracting the characteristics of each frame of audio by utilizing the characteristic learning network according to a specific expansion scale to obtain the characteristic vector of the audio data; and the optimization dimension reduction module is also used for performing optimization dimension reduction on the feature vector by using the optimization dimension reduction network to obtain the identity vector of the speaker to be identified.

In some possible embodiments, the optimized dimension reduction network comprises a frame level layer, a pooling layer, and a segment level layer, and the optimized dimension reduction module comprises a frame processing unit, a pooling unit, a segment processing unit, and a feature determination unit, wherein: the frame processing unit is used for optimizing the feature vector by utilizing the frame level layer to obtain a frame level feature vector; the pooling unit is used for mapping the frame-level vector to a feature vector with a fixed dimension by utilizing the pooling layer; the segment processing unit is used for connecting the feature vectors of the fixed dimensionality by using the segment level layer to obtain segment level feature vectors; and the characteristic determining unit is used for taking the characteristic vector of the segment level as the identity vector of the speaker to be identified.

In some possible embodiments, the feature learning network comprises at least a first sub-layer and a second sub-layer; the extraction submodule comprises a first unit, a second unit and a splicing unit, wherein: the first unit is configured to perform feature extraction on each frame of the audio according to a first expansion scale by using the first sublayer to obtain a first vector; the second unit is configured to perform feature extraction on each frame of audio according to a second expansion scale by using the second sublayer to obtain a second vector; wherein the second expansion dimension is greater than the first expansion dimension; and the splicing unit is used for connecting the first vector and the second vector to obtain the characteristic vector of the audio data.

In some possible embodiments, the expansion scale of adjacent sub-layers in the feature learning network increases by a power of 2; wherein N is an integer of 0 or more.

In some possible embodiments, the apparatus 600 further comprises a recognition model training module, including: the first obtaining sub-module is used for obtaining a voice corpus sample marked with the identity of a speaker; the second acquisition submodule is used for acquiring a built speaker recognition model, and the built speaker recognition model comprises an initial feature learning network and an initial optimized dimension reduction network; the initialization submodule is used for adopting the trained feature extraction network model parameters and initializing the weight parameters of the initial feature learning network; the training submodule is used for training the initial feature learning network and the initial optimized dimensionality reduction network by utilizing the voice corpus sample to obtain the speaker recognition model; and taking the feature vector extracted by the feature learning network as the input of the optimized dimension reduction network.

In some possible embodiments, the recognition model training module further comprises a feature learning sub-module comprising: the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring at least two types of original audio samples without labels; the second obtaining unit is used for obtaining a pseudo audio sample corresponding to each type of the original audio sample; the training unit is used for carrying out self-supervision training on an initial feature extraction network on the basis of contrast loss by utilizing each type of the original audio samples and the pseudo audio samples; and the parameter determining unit is used for determining model parameters of the feature extraction network under the condition that the contrast loss is less than or equal to a specific threshold value so as to obtain the trained feature extraction network.

In some possible embodiments, the at least two classes of unlabeled original audio samples include a first class of original audio output by the same speaker and a second class of original audio output by a different speaker; the second obtaining unit is further configured to perform data enhancement processing on the first type of original audio to obtain pseudo audio data of a same speaker as the first type of original audio; and/or performing data enhancement processing on the second type of original audio to obtain pseudo audio data of a speaker different from the second type of original audio.

Here, it should be noted that: the above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the speaker recognition method is implemented in the form of a software functional module and sold or used as a standalone product, the speaker recognition method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a smartphone with a camera, a tablet computer, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in any of the speaker recognition methods in the above embodiments.

Correspondingly, in an embodiment of the present application, a chip is further provided, where the chip includes a programmable logic circuit and/or program instructions, and when the chip is run, the chip is used to implement the steps in any of the speaker recognition methods in the foregoing embodiments.

Correspondingly, in an embodiment of the present application, a computer program product is further provided, which is used to implement the steps in any of the speaker recognition methods in the foregoing embodiments when the computer program product is executed by a processor of an electronic device.

Based on the same technical concept, the embodiment of the present application provides an electronic device for implementing the speaker recognition method described in the above method embodiment. Fig. 7 is a hardware entity diagram of an electronic device according to an embodiment of the present application, as shown in fig. 7, the electronic device 700 includes a memory 710 and a processor 720, the memory 710 stores a computer program that can be executed on the processor 720, and the processor 720 executes the computer program to implement steps in any speaker recognition method according to the embodiment of the present application.

The Memory 710 is configured to store instructions and applications executable by the processor 720, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 720 and modules in the electronic device, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).

The processor 720, when executing the program, performs the steps of any of the speaker recognition methods described above. The processor 720 generally controls the overall operation of the electronic device 700.

The Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic device implementing the above-mentioned processor function may be other electronic devices, and the embodiments of the present application are not particularly limited.

The computer storage medium/Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), and the like; and may be various electronic devices such as mobile phones, computers, tablet devices, personal digital assistants, etc., including one or any combination of the above-mentioned memories.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an automatic test line of a device to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speaker recognition, the method comprising:

acquiring audio data output by a speaker to be identified;

2. The method of claim 1, wherein the trained speaker recognition model comprises a feature learning network and an optimized dimension reduction network,

the method for performing framing and feature extraction on the audio data by using the trained speaker recognition model to obtain the feature vector of the audio data comprises the following steps: framing the audio data by using the feature learning network to obtain at least two frames of audio; extracting the features of each frame of audio by using the feature learning network according to a specific expansion scale to obtain a feature vector of the audio data;

the optimizing and dimension reduction of the feature vector by using the speaker recognition model to obtain the identity vector of the speaker to be recognized comprises the following steps: and optimizing and reducing the dimension of the characteristic vector by using the optimized dimension reduction network to obtain the identity vector of the speaker to be identified.

3. The method of claim 2, wherein the optimized dimension reduction network comprises a frame level layer, a pooling layer, and a segment level layer,

the optimizing and dimension reducing of the feature vector by using the optimizing and dimension reducing network to obtain the identity vector of the speaker to be identified comprises the following steps:

optimizing the feature vector by utilizing the frame level layer to obtain a frame level feature vector;

mapping the frame-level vectors to fixed-dimension feature vectors using the pooling layer;

connecting the feature vectors of the fixed dimensionality by using the segment level layer to obtain segment level feature vectors;

and taking the feature vector of the segment level as the identity vector of the speaker to be identified.

4. The method of claim 2 or 3, wherein the feature learning network comprises at least a first sub-layer and a second sub-layer;

the extracting the features of each frame of audio by using the feature learning network according to a specific expansion scale to obtain the feature vector of the audio data comprises the following steps:

performing feature extraction on each frame of audio by using the first sublayer according to a first expansion scale to obtain a first vector;

performing feature extraction on each frame of audio by using the second sublayer according to a second expansion scale to obtain a second vector; wherein the second expansion dimension is greater than the first expansion dimension;

and connecting the first vector and the second vector to obtain the characteristic vector of the audio data.

5. The method of claim 4, wherein the scale of expansion of adjacent sub-layers in the feature learning network increases as a power of 2 to the N; wherein N is an integer of 0 or more.

6. A method as claimed in any one of claims 1 to 3, wherein said speaker recognition model is trained by:

acquiring a voice corpus sample marked with the identity of a speaker;

acquiring a built speaker recognition model, wherein the built speaker recognition model comprises an initial characteristic learning network and an initial optimized dimension reduction network;

adopting the trained feature extraction network model parameters to initialize the weight parameters of the initial feature learning network;

training the initial feature learning network and the initial optimized dimensionality reduction network by using the voice corpus sample to obtain the speaker recognition model; and taking the feature vector extracted by the feature learning network as the input of the optimized dimension reduction network.

7. The method of claim 6, wherein the feature extraction network is pre-trained by:

obtaining at least two types of original audio samples without labels;

acquiring a pseudo audio sample corresponding to each type of original audio sample;

performing self-supervision training on an initial feature extraction network based on contrast loss by utilizing each type of the original audio samples and the pseudo audio samples;

and determining model parameters of the feature extraction network to obtain the trained feature extraction network under the condition that the contrast loss is less than or equal to a specific threshold value.

8. The method of claim 7, wherein the at least two classes of unlabeled original audio samples comprise a first class of original audio output by a same speaker and a second class of original audio output by a different speaker;

the obtaining of the pseudo audio sample corresponding to each type of the original audio sample includes:

performing data enhancement processing on the first type of original audio to obtain pseudo audio data of the same speaker as the first type of original audio; and/or the presence of a gas in the gas,

and performing data enhancement processing on the second type of original audio to obtain pseudo audio data of a speaker different from the first type of original audio.

9. A speaker recognition apparatus, comprising a sample acquisition module, a feature extraction module, and an optimization dimension reduction module, wherein:

10. An electronic device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the program.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.