CN115691510A

CN115691510A - Voiceprint recognition method based on random shielding training and computer equipment

Info

Publication number: CN115691510A
Application number: CN202211193071.9A
Authority: CN
Inventors: 陈玮; 冯少辉; 张建业
Original assignee: Beijing Iplus Teck Co ltd
Current assignee: Beijing Iplus Teck Co ltd
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-02-03

Abstract

The invention relates to a voiceprint recognition method and computer equipment based on random mask training, belonging to the technical field of voice recognition; the method comprises the following steps: registering a plurality of user voices through a pre-trained feature extraction model to obtain a user voice registry; the feature extraction model is obtained by constructing a lossy speech feature vector by adopting a random shielding method and training; acquiring a voice to be recognized, and performing feature extraction on the voice to be recognized through a feature extraction model to obtain a feature vector of the voice to be recognized; calculating cosine similarity values of the feature vectors of the voice to be recognized and all registered voices in a user voice registry; and confirming to obtain the user of the voice to be recognized based on the cosine similarity value. The invention solves the problems that the voiceprint recognition method in the prior art can not accurately recognize the damaged voice and has poor robustness.

Description

Voiceprint recognition method based on random shielding training and computer equipment

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voiceprint recognition method based on random mask training and computer equipment.

Background

Voiceprint recognition, also known as speaker recognition, establishes a model to identify the identity of a speaker by extracting voiceprint features from the speaker's speech. At present, the voiceprint recognition research mainly has three difficulties, firstly, semantic information of a speaker voice in a voice signal and personal information of a speaker voice characteristic are mixed in a complex form, and it is not easy to separate the individual characteristic of the speaker from the voice characteristic, which brings certain difficulty to the voiceprint recognition; in addition, the voice characteristics of the speaker are not static and fixed, but have physical characteristics of time-varying characteristics, and are often closely related to external factors such as the environment, emotion, health condition, time, age, and the like of the speaker, and even if the same person speaks the same voice, the voice signals are different; third, the performance of speaker recognition systems in applications has yet to be improved. When sound is transmitted through a communication line, the sound is inevitably interfered by line noise, and noise caused by different transmission lines is different. Therefore, the problem facing speaker recognition in a noisy environment is how to effectively remove the interference of various additive noises and multiplicative noises, and improve the robustness of the system under the condition of large noise and even existence of lossy speech.

Disclosure of Invention

In view of the foregoing analysis, the present invention aims to provide a voiceprint recognition method and computer device based on stochastic masking training; the method solves the problems that the voiceprint recognition method in the prior art cannot accurately recognize the damaged voice and is poor in robustness.

The purpose of the invention is mainly realized by the following technical scheme:

on one hand, the invention discloses a voiceprint recognition method based on random shielding training, which comprises the following steps:

registering a plurality of user voices through a pre-trained feature extraction model to obtain a user voice registry; the feature extraction model is obtained by constructing a lossy speech feature vector by adopting a random shielding method and training;

acquiring a voice to be recognized, and performing feature extraction on the voice to be recognized through the feature extraction model to obtain a feature vector of the voice to be recognized;

calculating cosine similarity values of the feature vectors of the voice to be recognized and all registered voices in the user voice registry; and confirming to obtain the user of the voice to be recognized based on the cosine similarity value.

Further, constructing a lossy speech feature vector by using a random shielding method and training the lossy speech feature vector to obtain the feature extraction model, wherein the method comprises the following steps:

constructing a random shielding characteristic coding model based on a Wav2Vec2 model;

inputting voice sample data into the random shielding feature coding model to perform feature extraction, random shielding and feature coding to obtain lossy voice feature vectors;

and performing mask filling and mask vector prediction on the lossy speech feature vector, and performing loss iteration update training to obtain the feature extraction model.

Further, the random masking feature coding model performs the following steps to obtain the lossy speech feature vector:

performing feature extraction on input voice sample data to obtain a voice feature vector with fixed dimensionality;

randomly shielding the voice feature vectors according to uniform distribution; recording the positions of the shielded vectors and the residual vectors, and embedding position information into the residual vectors to obtain lossy voice data after random shielding;

and performing voice inter-frame attention calculation on the lossy voice data after random shielding to obtain a lossy voice feature vector which has a time sequence relation and contains a context relation.

Further, the feature extraction model is obtained by training through the following steps:

constructing a voiceprint recognition training data set, wherein the voiceprint recognition training data set comprises voice sample data and a label representing a person to which the voice data belongs;

constructing a damaged voice feature vector based on the voice sample data by using the random shielding feature coding model;

carrying out vector filling and position information embedding on the lossy speech feature vector to obtain a feature vector after mask filling;

predicting the voice feature vector corresponding to the randomly shielded part based on the feature vector filled by the mask to obtain a complete voice feature vector;

carrying out average pooling and softmax classification on the complete voice feature vector to obtain label category distribution corresponding to the complete voice feature vector;

performing iterative updating on labels in the training data set and labels obtained by the classifier based on voiceprint recognition through a loss function to obtain a trained random shielding characteristic coding model; and based on the trained random shielding feature coding model, closing the random shielding function to obtain the feature extraction model.

Further, the vector filling and position information embedding of the lossy speech feature vector includes: assigning the masked vector as a shared learning vector or a fixed non-zero vector; and embedding corresponding position information into the shielded vectors and the residual vectors to obtain the mask-filled feature vectors.

Further, inputting the feature vector after the mask filling into a decoder, and obtaining a complete speech feature vector by the following formula:

wherein the content of the first and second substances,

representing the speech feature vectors after the addition of the filler vectors,

representing position-coded information, f _bn Representing a normalization operation, f _transformer Representing the transform decoder.

Further, performing average pooling and softmax classification on the complete speech feature vectors to obtain prediction vectors

The loss function is formulated as:

wherein, x is the input speech sequence, and y is the speaker identification tag.

Further, registering a plurality of user voices through a pre-trained feature extraction model to obtain a user voice registry, including: inputting the voice of the user to be registered into the feature extraction model to obtain a voice feature vector of the user to be registered, labeling the tag of the voice feature vector of the user to be registered based on the user ID, and storing the feature vector of the voice to be registered and the corresponding tag to obtain a user voice registry.

Further, the shielding rate of the random shielding is more than 50%.

In another aspect, a computer device is also disclosed, comprising at least one processor, and at least one memory communicatively coupled to the processor;

the memory stores instructions executable by the processor for execution by the processor to implement the aforementioned voiceprint recognition method.

The invention can realize at least the following beneficial effects:

1. the invention adopts a strong voiceprint recognition feature extraction model obtained by training through a random shielding method to extract the voice features, and combines a cosine similarity calculation method to confirm the voiceprint, thereby realizing the accurate recognition of the lossy voice under different environments.

2. The random shielding characteristic extraction model adopted by the invention is obtained by training random shielding operation with high shielding rate, and accurate characteristic extraction can be carried out through discontinuous or even damaged voice to be recognized, so as to obtain the voiceprint characteristic capable of representing the speaker.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The drawings, in which like reference numerals refer to like parts throughout, are for the purpose of illustrating particular embodiments only and are not to be considered limiting of the invention.

Fig. 1 is a flowchart of a voiceprint recognition method according to an embodiment of the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

The embodiment discloses a voiceprint recognition method based on random mask training, which comprises the following steps as shown in fig. 1:

step S1: registering a plurality of user voices through a pre-trained feature extraction model to obtain a user voice registry;

specifically, registering a plurality of user voices through a pre-trained feature extraction model to obtain a user voice registry, including: and extracting the voice input feature extraction model of the user to be registered to obtain a voice feature vector of the user to be registered, labeling the tag of the voice feature vector of the user to be registered based on the user ID, and storing the feature vector of the voice to be registered and the corresponding tag to obtain a user voice registry.

The feature extraction model is obtained by constructing a lossy speech feature vector by adopting a random shielding method and training.

Specifically, constructing a lossy speech feature vector by adopting a random shielding method and training the lossy speech feature vector to obtain a feature extraction model, wherein the method comprises the following steps of:

inputting voice sample data into a random shielding feature coding model to perform feature extraction, random shielding and feature coding to obtain lossy voice feature vectors;

and performing mask filling and mask vector prediction on the lossy speech feature vectors, and performing loss iterative update training to obtain a feature extraction model.

As a specific embodiment, the feature extraction model can be obtained by training through the following steps:

step S101: constructing a voiceprint recognition training data set, wherein the voiceprint recognition training data set comprises voice sample data and a label representing a person to which the voice data belongs;

the voice data is obtained by preprocessing the voice of the speaker without constraint conditions in the Chinese voice data set. The tags are obtained by classifying and labeling voices of different speakers, for example, a number is labeled to the voice of each speaker to indicate a tag category corresponding to each voice.

Specifically, the constructing of the voiceprint recognition training data set comprises:

acquiring a data set; the data set includes voice sample data of a plurality of speakers;

constructing a voice data label, wherein the voice data label is a speaker ID corresponding to voice sample data;

voice sample data is preprocessed through sound channel splitting, voice cutting and a unified format to obtain voice data with single sound channel and fixed length, and a voiceprint recognition training data set is formed.

Preferably, the data set sources used in this embodiment are chinese speech data sets cn-celeb1 and cn-celeb2 identified by speakers without constraint conditions, where cn-celeb1 records and captures 997 individual recordings or interviews, cn-celeb2 records and captures 1996 individual recordings or interviews, voices of speakers with less than fifty speech segments are deleted, and tags are respectively labeled on the data in the remaining cn-celeb1 and cn-celeb2 data sets according to the IDs of the speakers.

Further, the data is preprocessed, and the steps are as follows:

splitting a sound channel: splitting all the voices with the number of channels larger than 1 into single-channel voices, and independently performing mute removing operation on data of each channel: the method comprises the steps of detecting voice fragments by using a WebRTC voice endpoint detection method, firstly taking input voice as a voice fragment every 20ms, segmenting original voice into a plurality of fragments, respectively detecting whether the fragments are silent, deleting the fragments if the fragments are silent, and otherwise, keeping the fragments.

Voice cutting: cutting the voice into voice with a fixed length range according to the min _ token and the max _ token; abandoning the voices with the length less than the min _ token, and truncating the voices with the length more than the max _ token, wherein the min _ token is 56000 for representing the voices with the length not less than 5s, and the max _ token is 480000 for representing the voices with the length not more than 30 s.

The format is unified: and uniformly converting the cut voice into a format with the sampling rate of 16000 and the sampling precision of 16 bits.

The preprocessed voice data and the corresponding labels form a voiceprint recognition data set, and the training data is used as the voiceprint recognition training data set.

Step S102: constructing a lossy voice feature vector based on voice sample data by using a random shielding feature coding model;

preferably, the feature coding model of this embodiment uses the wav2vec2 model as an initial model structure, and does not change the original model structure. The input of the method is voice sampling data which is preprocessed in a training data set, original voice data is input as a model after being sampled, voice feature vector representation is obtained through a cnn feature extraction layer in wav2vec2, then random shielding operation is carried out, a part of vectors of the obtained voice feature vectors are shielded, positions of the shielded vectors and positions of residual vectors are memorized, high-proportion shielding is adopted in the embodiment, shielding probability is larger than 50%, then the residual voice vectors are sequentially arranged, position vector information is embedded into a transformer structure of the wav2vec2, and lossy voice vector representation output by a feature encoder is obtained.

More specifically, based on the feature encoder of wav2vec2 model structure, the original speech data is sampled to obtain X = (X) ₀ ,x ₁ ,...,x _l ) As input, the output of each layer is used as the input of the next layer through 7 layers of convolution networks, the step length of each layer is (5, 2), the convolution kernel widths are respectively (10, 3, 2), and a speech feature vector with a fixed dimension of 512 is generated after feature coding, so that hidden layer features C = (C) are obtained ₀ ,c ₁ ,...,c _L ) Dimension (1, L, 512), where L is equal to L/320. Then mapping the 512-dimensional vector to a 768-dimensional space to obtain a voice feature vector representation S; then carrying out random shielding operation on the characteristic vector S, embedding position information into the shielded residual vector, and carrying out voice inter-frame attention calculation on the block entering the 12 layers; each block is a transform structure containing 768 hidden layer units and a self-attention network, the powerful processing time sequence relation function of the transform can be fully utilized, each speech frame feature is fully context-coded, the obtained speech feature vector has a time sequence relation and contains a context relation, and the output characteristics of the speech coded by the 12 layers of transforms are as follows: h = (H) ₀ ,h ₁ ,...,h _l ) The dimension is [1,l,768 ]](ii) a Where vectors S and H can be represented as:

S＝f _bn (Wf _cnn (X)+b)；

wherein f is _cnn Represents a cnn feature extractor, W is equal to R ^p*q Representing a 512-dimensional to 768-dimensional spatial mapping, p and q represent dimensions, 512 and 768 respectively,

representing the remaining part of S masked by a mask, P _emb Represent

Position information of (f) _bn Representing a normalization operation, f _transformer Representing the transformer encoder in wav2vec 2.

It should be noted that, for the feature encoder, the random masking operation is only performed during the training period, and the random masking function needs to be turned off in practical application. Because the shielding part of the same voice is random, one voice can be divided into a plurality of voice with different text information, and the data enhancement effect is achieved. In addition, because high-proportion shielding is adopted, the left and right sections of each shielding part vector are shielded at a high probability, and the model can learn the voiceprint characteristics of the speaker through the discontinuous speech characteristics, so that the trained characteristic encoder has high robustness. And finally, the shielded partial vectors are restored through a decoder, and iterative updating training is carried out, so that the model can identify all the voice features only through partial voice features, the requirement on the extraction capability of the encoder is stronger, the voice features can be understood in a deeper layer, and a feature extraction model with strong feature extraction capability can be obtained through training.

For random masking operations, random sampling masking is performed in a uniform distribution. Taking a random mask at a high masking rate (i.e., greater than 50%) can avoid the task of easily predicting by extrapolating from unmasked neighboring vectors. The uniform distribution may prevent potential center shifts (i.e., higher masking probability near the center of the speech segment). And the highly sparse input creates conditions for training a high-efficiency feature encoder, a very powerful encoder can be trained through few calculations and memories, and the training efficiency can be improved by 2-3 times.

Step S103: carrying out vector filling and position information embedding on the lossy speech feature vector to obtain a feature vector after mask filling; the method comprises the following steps: assigning the masked vector as a shared learning vector or a fixed non-zero vector; and embedding corresponding position information into the shielded vector and the residual vector to obtain the feature vector after mask filling.

Specifically, for the vector H output by the feature encoder, a unified vector I that can be learned is added to the masked part according to the recorded position information, and the position vector encoding is added to the masked and unmasked parts to obtain the feature representation Z.

Step S104: predicting the voice feature vector corresponding to the randomly shielded part based on the feature vector filled by the mask to obtain a complete voice feature vector;

specifically, the vector Z obtained after the mask filling is input into a transform decoder for decoding to obtain a final vector representation Z _out ：

Wherein, the first and the second end of the pipe are connected with each other,

representing the complete speech feature vector after adding vector I to the vector H output by the encoder,

representing position-coded information, f _bn Representing a normalization operation, f _transformer Representing a transform decoder.

Preferably, when the present embodiment adopts the wav2vec2 model as the initial model for training, the encoder and decoder of the wav2vec2 model can be designed to be in an asymmetric structure. Specifically, the main function of the decoder in this embodiment is to recover the padded eigenvectors to the information before being masked as much as possible, and the structure of the decoder is the same as that of the transform encoder, and the input is a vector formed by the unmasked eigenvectors, the padded vectors and the position codes; where the position code may represent the position information of the feature vector in the complete speech. Since the Transformer decoder is only used during training, performs the speech reconstruction task, and the final recognition task is completed by the wav2vec encoder, the decoder architecture can be designed independently of the design of the encoder, and experiments can be performed with a very small decoder, which is narrower and shallower than the encoder. For example, the encoder employs a 12-layer 768-dimensional transform structure, and the decoder may use a 2-layer or 3-layer, 384-dimensional scale size, with the decoder computing less than 10% per token compared to the encoder. With this asymmetric design, all tokens are processed through a lightweight decoder, greatly reducing training time.

Step S105: carrying out average pooling and softmax classification on the complete voice feature vectors to obtain label category distribution corresponding to the complete voice feature vectors;

the output layer of the method predicts the speaker label category distribution through an average pooling operation and a softmax layer, the output vector dimension is the label category quantity in a training data set, and a prediction vector is obtained through the softmax layer

It should be noted that, in the process of pre-training the original Wav2vec2 model, the model loss is composed of both the antagonism loss and the diversity loss, but in this embodiment, since a high shielding rate operation is adopted, it is too difficult for the model to predict the shielded part information, and convergence cannot be obtained. In addition, the final task of the invention is to obtain the speaker voice feature vector representation, which is a task independent of the text. In this embodiment, based on the complete speech feature vector obtained by decoding, through average pooling operation, and using a full-connection layer network to convert the complete speech feature vector into a label dimension of a corresponding task, probability normalization operation is performed through an activation function, and finally, a category corresponding to a value with the highest confidence level is selected to obtain category information of the speech data

Step S106: performing iterative updating on labels in the training data set and labels obtained by the classifier based on voiceprint recognition through a loss function to obtain a trained random shielding characteristic coding model; and based on the trained random shielding feature coding model, closing the random shielding function to obtain a feature extraction model.

Specifically, in this embodiment, the following loss function is used to perform loss iterative update, so as to obtain a trained feature coding model:

In practical application, closing the random shielding function of the trained feature coding model to obtain a feature extraction model, inputting the voice data to be recognized into the feature extraction model, obtaining a feature vector with the voice voiceprint features of the speaker through CNN feature extraction and transform coding, and recognizing and confirming the identity of the speaker based on the obtained feature vector.

Step S2: and acquiring the voice to be recognized, and performing feature extraction on the voice to be recognized through the feature extraction model to obtain a feature vector of the voice to be recognized.

Specifically, in this embodiment, the existing speech recognition method is adopted to perform speech recognition, and the speech obtained through recognition is preprocessed and then input to a trained feature extraction model to perform feature extraction. The feature extraction model obtained by training through the random shielding training method has strong feature extraction capability, can perform feature extraction and coding on input voice to obtain a feature vector with user voiceprint information, and is used for performing cosine similarity calculation subsequently to confirm the identity of a user.

And step S3: calculating cosine similarity values of the feature vectors of the voice to be recognized and all registered voices in the user voice registry; and confirming to obtain the user of the voice to be recognized based on the cosine similarity value.

Specifically, a similarity value between the voice feature vector of the voice to be recognized and the voice feature vector in the voice registry is calculated, and the user to which the registered voice feature vector with the highest similarity value larger than a threshold value belongs is selected as the user to which the voice to be recognized belongs.

In this embodiment, the speech to be recognized is the speech of the user in the speech registry, and for the speech of the unregistered speaker, after feature extraction is performed by the feature extraction model, the similarity score between the speech of the unregistered speaker and the speech feature vector in the speech registry is smaller than a threshold value, and a non-matching user is prompted.

In another embodiment of the present invention, a computer device is also disclosed, the computer device comprising at least one processor, and at least one memory communicatively coupled to the processor;

the memory stores instructions executable by the processor, where the instructions are for execution by the processor to implement the voiceprint recognition method described above.

In summary, the invention adopts the voiceprint recognition method based on the random shielding method training, and can realize accurate recognition of discontinuous speech or lossy speech in various noise environments. In the training process, high-density random shielding and feature extraction are carried out on input voice, lossy voice feature vectors are constructed for training, voice features of the randomly shielded part are reconstructed in the training process, and a voiceprint recognition feature extraction model and a voiceprint recognition method which are irrelevant to contents are realized. The robustness of voiceprint recognition is improved, the method can be suitable for various voiceprint recognition occasions, and the damaged voice can be accurately recognized.

Those skilled in the art will appreciate that all or part of the processes for implementing the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, for instructing the relevant hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A voiceprint recognition method based on random mask training is characterized by comprising the following steps:

2. The voiceprint recognition method according to claim 1, wherein the step of constructing a lossy speech feature vector by using a random masking method and performing training to obtain the feature extraction model comprises the steps of:

3. The voiceprint recognition method of claim 2 wherein said stochastic masking feature coding model performs the following steps to obtain said lossy speech feature vector:

4. The voiceprint recognition method according to claim 2, wherein the feature extraction model is obtained by training through the following steps:

performing average pooling and softmax classification on the complete voice feature vector to obtain label category distribution corresponding to the complete voice feature vector;

5. The training method of the voiceprint recognition feature extraction model according to claim 4, wherein the vector filling and the embedding of the position information for the lossy speech feature vector comprise: assigning the masked vector as a shared learning vector or a fixed non-zero vector; and embedding corresponding position information into the shielded vector and the residual vector to obtain the feature vector after mask filling.

6. The training method of the voiceprint recognition feature extraction model according to claim 4, wherein the feature vector after the mask filling is input into a decoder, and a complete speech feature vector is obtained by the following formula:

wherein the content of the first and second substances,

7. The method of claim 4, wherein the complete speech feature vector is averaged and sorted by softmax to obtain a prediction vector

The loss function is formulated as:

8. The voiceprint recognition method according to claim 1, wherein registering a plurality of user voices through a pre-trained feature extraction model to obtain a user voice registry comprises: inputting the voice of the user to be registered into the feature extraction model to obtain a voice feature vector of the user to be registered, labeling the tag of the voice feature vector of the user to be registered based on the user ID, and storing the feature vector of the voice to be registered and the corresponding tag to obtain a user voice registry.

9. The voiceprint recognition method of claim 3, wherein the masking rate of the random mask is greater than 50%.

10. A computer device comprising at least one processor and at least one memory communicatively coupled to the processor;

the memory stores instructions executable by the processor for execution by the processor to implement the voiceprint recognition method of any one of claims 1 to 9.