CN114913860A

CN114913860A - Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product

Info

Publication number: CN114913860A
Application number: CN202210450804.6A
Authority: CN
Inventors: 黄淋; 饶宇熹; 宁博; 黎明欣
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2022-04-27
Filing date: 2022-04-27
Publication date: 2022-08-16

Abstract

The application relates to a voiceprint recognition method, a voiceprint recognition device, a computer device, a storage medium and a program product. The method comprises the following steps: and acquiring voice data to be recognized, and inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized. By adopting the method, the pre-trained voiceprint recognition model can be applied to recognize the voice data to be recognized to obtain the voiceprint recognition result, and because the convolutional layer parameters which are generated by the countermeasure network model and are obtained by training after data amplification is carried out on the small sample voice data training set are quoted during training of the voiceprint recognition model to be trained, the knowledge obtained by training of the large sample voice data set is quoted during training of the voiceprint recognition model to be trained, the convergence rate of training of the voiceprint recognition model can be further accelerated, and the accuracy of voiceprint recognition model recognition is improved.

Description

Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product

Technical Field

The present application relates to the field of computer technologies, and in particular, to a voiceprint recognition method, apparatus, computer device, storage medium, and program product.

Background

Voiceprint recognition can judge the identity of a speaker by extracting voice characteristics of the speaker when the speaker speaks, is suitable for remote recognition scenes due to the advantages of convenience in acquisition, high user acceptance, low cost, non-contact and the like, and is widely applied to the fields of banks, securities and the like.

Based on the fact that the deep learning algorithm becomes a mainstream method for extracting deep voice features due to strong nonlinear expression and automatic learning capability, in the related technology, a deep voice print recognition model based on deep learning is adopted to extract deep voice features of voice data to realize voice print recognition. However, voice data in an actual scene is relatively deficient, and it is difficult to obtain enough voice data to train a model, and a voiceprint recognition model trained by using a voice data set of a small sample has a problem of low accuracy of voiceprint recognition.

Disclosure of Invention

In view of the above, it is desirable to provide a voiceprint recognition method, apparatus, computer device, storage medium and program product for solving the above technical problems.

In a first aspect, a voiceprint recognition method includes:

acquiring voice data to be recognized;

inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized;

the initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample voice data training set.

In one embodiment, the process of constructing the voiceprint recognition model comprises the following steps:

training the initially generated confrontation network model through a small sample voice data training set to obtain a generated confrontation network model;

migrating the network parameters of the generated confrontation network model to an initial voiceprint recognition model, and training the initial voiceprint recognition model through a small sample voice data training set to obtain a voiceprint recognition model; the initial voiceprint recognition model and the convolution layer for generating the countermeasure network model have the same structure.

In one embodiment, the initially generated confrontation network model comprises an initial generator network and an initial arbiter network;

training the initially generated confrontation network model through a small sample voice data training set to obtain a generated confrontation network model, including:

preprocessing a small sample voice data training set to obtain a preprocessing result;

and inputting random noise data into an initial generator network to obtain generated data, inputting a preprocessing result and the generated data into an initial discriminator network, and performing joint training on the initial generator network and the initial discriminator network to obtain a generator network and a discriminator network.

In one embodiment, inputting the preprocessing result and the generated data into an initial arbiter network, and performing joint training on the initial generator network and the initial arbiter network to obtain a generator network and an arbiter network, including:

inputting the preprocessing result and the generated data into an initial discriminator network to obtain an initial discrimination prediction result;

calculating a prediction error value between the initial judgment prediction result and the standard judgment result through a loss function;

updating network parameters in the initial generator network and the initial discriminator network according to the prediction error value;

and if the predicted error value meets the preset convergence condition, determining that the training of the initial generator network and the training of the initial discriminator network are finished, and obtaining the generator network and the discriminator network.

In one embodiment, preprocessing a small sample speech data training set to obtain a preprocessing result includes:

framing the small sample voice data in the small sample voice data training set to obtain a plurality of voice frame data;

windowing each voice frame data to obtain corresponding windowed data;

performing Fourier transform on each windowed data to determine a two-dimensional spectrogram;

and mapping the frequency data in the two-dimensional spectrogram onto a Mel scale to obtain Mel spectrogram data, and determining the Mel spectrogram data as a preprocessing result.

In one embodiment, the arbiter network in the generation countermeasure network model comprises a first convolution layer and a first fully connected layer; the initial voiceprint recognition model comprises a second convolution layer and a second full-connection layer;

then, the network parameters for generating the confrontation network model are transferred to the initial voiceprint recognition model, and the initial voiceprint recognition model is trained through a small sample speech data training set, which includes:

determining the network parameters of the first convolutional layer in the discriminator network as the network parameters of the second convolutional layer in the initial voiceprint recognition model, and initializing the network parameters of the second full-connection layer;

the initial voiceprint recognition model is trained through a small sample speech data training set.

In a second aspect, a voiceprint recognition apparatus, said apparatus comprising:

the voice data acquisition module is used for acquiring voice data to be recognized;

the voiceprint recognition module is used for inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized;

In a third aspect, a computer device comprises a memory and a processor, the memory storing a computer program, and the processor implementing the steps of any of the above methods in the first aspect when executing the computer program.

In a fourth aspect, a readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods in the first aspect described above.

In a fifth aspect, a readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods in the first aspect described above.

According to the voiceprint recognition method, the voiceprint recognition device, the computer device, the storage medium and the program product, the computer device can acquire the voice data to be recognized and input the voice data to be recognized into the preset voiceprint recognition model to obtain the voiceprint recognition result of the voice data to be recognized; the method can use the pre-trained voiceprint recognition model to recognize the voice data to be recognized to obtain the voiceprint recognition result, and the voice recognition model to be trained quotes the convolutional layer parameters which are obtained by training after data amplification is carried out on the small sample voice data training set and are used for generating the countermeasure network model, so that the knowledge obtained by training the large sample voice data set is quoted when the voiceprint recognition model to be trained is trained, the convergence rate of the voiceprint recognition model training can be further accelerated, and the accuracy of the voiceprint recognition model recognition is improved.

Drawings

FIG. 1 is a diagram of an application environment of a voiceprint recognition method in one embodiment;

FIG. 2 is a flow diagram illustrating a voiceprint recognition method in one embodiment;

FIG. 3 is a flow diagram illustrating a method for constructing a voiceprint recognition model in one embodiment;

FIG. 4 is a schematic flow chart illustrating a method for training an initially generated confrontation network model with a small sample speech data training set to obtain a generated confrontation network model in another embodiment;

FIG. 5 is a flow chart illustrating a method for joint training of an initial generator network and an initial arbiter network in another embodiment;

FIG. 6 is a flowchart illustrating a method for preprocessing a training set of small sample speech data to obtain a preprocessed result according to another embodiment;

FIG. 7 is a flowchart illustrating a method for migrating network parameters for generating a confrontation network model into an initial voiceprint recognition model and training the initial voiceprint recognition model in another embodiment;

FIG. 8 is a block diagram of the structure of a voiceprint recognition apparatus in one embodiment;

FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The voiceprint recognition method provided by the application can be applied to the voiceprint recognition system shown in fig. 1. The voiceprint recognition system comprises a voice acquisition device and a computer device. Optionally, the voice collecting device may be a recording pen, a sound collector, a voice detector, or the like; the computer device may be implemented as an independent server or a server cluster composed of a plurality of servers, and may also be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The voice acquisition equipment and the computer equipment can be in communication connection; the communication mode can be Wi-Fi, a mobile network or Bluetooth connection and the like. The following embodiment will explain the implementation process of the voiceprint recognition method.

In order to extract the speech features of a speaking object to determine the identity of the speaking object, the embodiment of the present application provides a voiceprint recognition method, which is described by taking the method applied to the computer device in fig. 1 as an example, as shown in fig. 2, the voiceprint recognition method includes the following steps:

and S100, acquiring voice data to be recognized.

Specifically, the voice acquisition device can acquire the voice data of at least one speaking object at regular time, and further, the computer device can receive the voice data sent by the voice acquisition device in real time, namely the voice data to be recognized. Or, the voice acquisition device may store the acquired voice data of the speaking object to the local or cloud end, and in actual application, the computer device may acquire the pre-stored voice data, that is, the voice data to be recognized, from the local or cloud end. In addition, the computer equipment can also download voice simulation data on the Internet, namely the voice data to be recognized on line.

It should be noted that the speech data to be recognized may be speech data containing only the speaking object. However, another way to acquire the speech data to be recognized may be that the computer device may acquire a speech data set to be recognized, and then perform feature extraction on the speech data set to be recognized to obtain speech data only including a speaking object, that is, the speech data to be recognized. Alternatively, the speech data set to be recognized may include speech data of a speaking subject as well as speech data of a non-speaking subject. In this embodiment, the voice data of the non-speaking object may be voice data of the environment, such as voice data of an electronic device, voice data of nature (such as wind, thunder, running water), and so on.

S200, inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized. The initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample voice data training set.

Specifically, the preset voiceprint recognition model may be a pre-trained voiceprint recognition model. Alternatively, the voiceprint recognition model can be a deep learning network model, such as the deep learning network model can be at least one of a convolutional neural network model, a cyclic neural network model, a deep belief network model, and so on.

It should be noted that the voiceprint recognition model may include at least one of a convolutional layer, a fully-connected layer, and a pooling layer, and the number of the convolutional layer, the fully-connected layer, and the pooling layer included in the voiceprint recognition model may be arbitrary, which is not limited herein. However, in this embodiment, the voiceprint recognition model includes at least one convolutional layer. Optionally, before training the voiceprint recognition model, the initial parameters (network parameters) of the convolutional layer during training the voiceprint recognition model may be the pre-trained convolutional layer parameters that generate the convolutional layer in the countermeasure network model.

Alternatively, the data augmentation may be understood as a process of cropping, shifting, changing brightness, adding noise, rotating angle, and/or mirroring the speech data in the training set of small sample speech data. Optionally, the voiceprint recognition result may be an identity of a speech object; the identity of the speaker object may be speaker identification or speaker verification. Optionally, the speaker identification may be understood as performing a one-to-N comparison between the voice data of the speaking object and the voice data of N speakers in the preset database, finding out the voice data with the same comparison result, and determining the speaker corresponding to the voice data as the speaking object; the speaker verification can be understood as comparing the voice data of the speaking object with the voice data of the speaking object in a preset database one to prove the speaking object.

The voiceprint recognition method can acquire the voice data to be recognized, and input the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized; the method can use a pre-trained voiceprint recognition model to recognize the voice data to be recognized to obtain a voiceprint recognition result, and the voiceprint recognition model to be trained quotes convolutional layer parameters which are obtained by training a small sample voice data training set after data amplification, and generate an antagonistic network model, so that the knowledge obtained by training a large sample voice data set is quoted during training the voiceprint recognition model to be trained, the convergence rate of the voiceprint recognition model training can be further accelerated, and the accuracy of voiceprint recognition model recognition is improved.

Because training a voiceprint recognition model by using a small sample speech data training set generally causes problems of overfitting, poor generalization capability and low accuracy of voiceprint recognition of the model, in one embodiment, as shown in fig. 3, the construction process of the voiceprint recognition model can be implemented by the following steps:

s210, training the initially generated confrontation network model through a small sample voice data training set to obtain the generated confrontation network model.

In particular, the small sample speech data training set may be a set of speech data combinations of multiple speaking subjects. Optionally, generating the antagonistic network model may include generating a model and discriminating a model. Optionally, the generative model and the discriminant model may each include at least one of a convolutional layer, a fully-connected layer, and a pooling layer. The overall structures of the generative model and the discriminant model may be the same or different, but the network parameters of the generative model and the discriminant model are different. In this embodiment, the structures of the same layer in different models may be the same or different.

S220, migrating the network parameters for generating the confrontation network model to an initial voiceprint recognition model, and training the initial voiceprint recognition model through a small sample voice data training set to obtain a voiceprint recognition model; the initial voiceprint recognition model and the convolution layer for generating the countermeasure network model have the same structure.

Specifically, the step of transferring the network parameters for generating the confrontation network model to the initial voiceprint recognition model may be understood as determining the trained network parameters for generating the model and/or the discriminant model in the confrontation network model as part of or all of the network parameters in the initial voiceprint recognition model. Further, the computer device may train the initial voiceprint recognition model through a small sample speech data training set.

The training set of the small sample voice data for training the initial voiceprint recognition model may be different from the training set of the small sample voice data for training the confrontation network model. However, in this embodiment, the training set of small sample speech data for training the initial voiceprint recognition model is the same as the training set of small sample speech data for training the generation of the confrontation network model.

In this embodiment, the initial voiceprint recognition model and the generative confrontation network model may each include a convolutional layer, and in order to achieve convolutional layer parameter migration of the convolutional layer, the structures of the convolutional layers in the initial voiceprint recognition model and the generative confrontation network model may be the same. Optionally, the network structures of the initial voiceprint recognition model and the generated countermeasure network model except for the convolutional layer may be the same or different.

According to the voiceprint recognition method, after data amplification is carried out on a small sample voice data training set, convolutional layer parameters which are obtained through training and generate an confrontation network model are transferred to an initial voiceprint recognition model, then the initial voiceprint recognition model is trained through the small sample voice data training set, so that knowledge obtained through training of a large sample voice data set is quoted during training of the initial voiceprint recognition model, the convergence rate of training of the voiceprint recognition model can be further accelerated, and the accuracy of voiceprint recognition model recognition is improved; in addition, the method utilizes the generated confrontation network model to amplify the small sample voice data set, so that the difference between the amplified data set and the small sample voice data set can be reduced, and then migration learning is used for migrating the convolutional layer parameters trained by the amplified sample to the voiceprint recognition model for small sample training, so that the generalization capability of the voiceprint recognition model trained by the small sample can be improved.

The following describes how to train the initially generated confrontation network model through the small sample speech data training set in order for the generated confrontation network model to learn the knowledge in the large sample speech data set. In one embodiment, the initially generated confrontation network model includes an initial generator network and an initial discriminator network; as shown in fig. 4, the step of training the initially generated confrontation network model through the small sample speech data training set in S210 to obtain the generated confrontation network model may be implemented by the following steps:

s211, preprocessing the small sample voice data training set to obtain a preprocessing result.

Specifically, the computer device may perform preprocessing on the voice data in the small sample voice data training set, such as removing noise (environmental noise, busy tone, color ring tone, etc.), enhancing data (aliasing echo, change rate, time domain and frequency domain random masking), clipping, data conversion, and/or feature extraction, to obtain a preprocessing result.

S212, inputting the random noise data into the initial generator network to obtain generated data, inputting the preprocessing result and the generated data into the initial discriminator network, and performing combined training on the initial generator network and the initial discriminator network to obtain a generator network and a discriminator network.

Specifically, the random noise may be gaussian noise, single frequency noise, impulse noise, fluctuation noise, white noise, and/or the like. In the process of initially generating the confrontation network model training, the computer device may first input the generated random noise data to the initial generator network to obtain corresponding simulation data (i.e., generated data), then input the preprocessing result and the generated data to the initial discriminator network to perform joint training on the initial generator network and the initial discriminator network, and when it is determined that both the initial generator network and the initial discriminator network satisfy the corresponding convergence condition, determine that the initial generator network at this time is the generator network, and the initial discriminator network at this time is the discriminator network. Alternatively, joint training may be understood as a process of training the initial generator network and the initial discriminator network simultaneously.

It should be noted that the initial generator network may include a fully connected layer and a plurality of deconvolution layers. The computer device can input random noise data into an initial generator network, convert the random noise data into three-dimensional data through a full connection layer, and then up-sample the three-dimensional data through a plurality of deconvolution layers to obtain generated data. Optionally, each deconvolution layer may output twice the amount of feature data as the last adjacent deconvolution layer.

It will be appreciated that the initial arbiter network may comprise a plurality of two-dimensional convolutional layers and a fully-connected layer. The computer device can input the preprocessing result and the generated data into the initial discriminator network, perform down-sampling by a plurality of two-dimensional convolution layers, learn the deep voice characteristics of the input data, and then output the discrimination result of the initial discriminator network through the full connection layer.

As shown in fig. 5, the step of inputting the preprocessing result and the generated data into the initial arbiter network in S212, and performing joint training on the initial generator network and the initial arbiter network to obtain the generator network and the arbiter network may be implemented by the following steps:

s2121, inputting the preprocessing result and the generated data into an initial discriminator network to obtain an initial discrimination prediction result.

Specifically, the computer device may input the pre-processing result or the generated data to the initial arbiter network to obtain the initial arbitration prediction result. However, in this embodiment, the computer device may simultaneously input the preprocessing result and the generated data output by the initial generator network to the initial arbiter network to obtain the initial discrimination prediction result.

It should be noted that, when the initial generator network and the initial discriminator network are not trained, the initial discriminator network can correctly discriminate whether the preprocessing result is true or false with respect to the generated data, and in this case, the initial discrimination prediction result may be the preprocessing result and the generated data with the identifier. Optionally, the identifier may distinguish between the pre-processing result and the authenticity of the generated data. Wherein the preprocessing result can be determined as real data; the generated data is analog data and can be determined as pseudo data.

Optionally, when the training of the initial generator network and the initial arbiter network is finished, the initial arbiter network may erroneously determine whether the pre-processing result is true or false with respect to the generated data, that is, in this case, the pre-processing result is determined as pseudo data, and the generated data is determined as real data. In this case, the initial discrimination prediction result may be a preprocessing result and generated data carrying an identifier, but in this case, the identifier carried by the preprocessing result and generated data is opposite to the identifier carried by the initial discriminator network and generated data when the initial generator network and the initial discriminator network are not trained.

And S2122, calculating a prediction error value between the initial judgment prediction result and the standard judgment result through a loss function.

Specifically, the above-mentioned loss function may be a 0-1 loss function, a squared loss function, an absolute value loss function, a logarithmic loss function, or the like. Optionally, the loss function includes a parameter corresponding to the initial discrimination prediction result and the standard discrimination result.

It should be noted that the computer device may bring the initial discrimination prediction result into the loss function to obtain a prediction error value between the initial discrimination prediction result and the standard discrimination result. Optionally, the standard determination result may be a preprocessing result and generated data, where the identifier carried in the preprocessing result is an identifier of dummy data, and the identifier carried in the generated data is an identifier of real data. Optionally, the standard discrimination result may be understood as a gold standard for training the anti-network model.

And S2123, updating network parameters in the initial generator network and the initial discriminator network according to the prediction error value.

Specifically, the computer device may adjust network parameters in the initial generator network and the initial arbiter network based on the magnitude of the prediction error value. Optionally, if the prediction error value is larger, the adjustment value of the network parameter may be slightly larger, and if the prediction error value is larger, the adjustment value of the network parameter may be slightly smaller.

And S2124, if the prediction error value meets a preset convergence condition, determining that both the initial generator network and the initial discriminator network are trained, and obtaining the generator network and the discriminator network.

It can be understood that, in the training process of the initial generator network and the initial discriminator network, the steps in S211, S2121-S2123 need to be continuously iterated, after each iteration process, it may be determined whether the prediction error value is less than or equal to the preset error threshold, or the iteration number reaches the preset iteration number threshold, if it is determined that the prediction error value is less than or equal to the preset error threshold, or the iteration number reaches the preset iteration number threshold, it is determined that both training of the initial generator network and the initial discriminator network is completed, the current initial generator network is determined to be the generator network, and the current initial discriminator network is determined to be the discriminator network.

According to the voiceprint recognition method, the confrontation network model can be generated by training after data amplification is carried out on the small sample voice data training set, so that the generated confrontation network model can learn a large amount of knowledge in large sample voice data, network parameters of the generated confrontation network model can be further migrated to the voiceprint recognition model to carry out small sample training, and the generalization capability of the voiceprint recognition model for small sample training is improved.

When the confrontation network model is trained through the voice data, the characteristics of the voice data need to be extracted, so that before training, the voice data can be preprocessed to obtain the Mel frequency spectrogram data (namely, the voice characteristic data). In an embodiment, as shown in fig. 6, the step of preprocessing the small sample speech data training set in S211 to obtain a preprocessing result may specifically include:

s2111, framing the small sample voice data in the small sample voice data training set to obtain a plurality of voice frame data.

Specifically, since the voice data is non-stationary, but the voice data is stationary in a short time, and no abrupt change occurs, in order to facilitate the processing, the small sample voice data in the small sample voice data training set may be divided into multiple frames of stationary voice data. Alternatively, the voice data may be one-dimensional data.

Optionally, the computer device may perform framing on the small sample voice data in the small sample voice data training set according to a time sequence to obtain a plurality of voice frame data.

S2112, windowing is carried out on the voice frame data respectively to obtain corresponding windowed data.

Specifically, in order that the spectrum energy of each voice frame data does not leak during the subsequent fourier transform, the windowing process may be performed on each voice frame data. Alternatively, the windowing process may be understood as a process of intercepting the voice frame data by a window function. Alternatively, the window type may be a rectangular window, a triangular window, a hanning window, a gaussian window, or the like.

It should be noted that, because two ends of the hamming window are not zero, the side lobe leakage can be reduced very close. Therefore, in this embodiment, the computer device may add a hamming window to each voice frame data, respectively, to obtain windowed data corresponding to each voice frame data.

S2113, performing Fourier transform on each windowed data, and determining a two-dimensional spectrogram.

Further, the computer device may perform fourier transform on each windowed data, and combine the fourier transform results together according to the time order to obtain the two-dimensional spectrogram. Optionally, the two-dimensional spectrogram may include X-axis data and Y-axis data; the X-axis data may be time and the Y-axis data may be frequency.

S2114, mapping the frequency data in the two-dimensional spectrogram onto a Mel scale to obtain Mel spectrogram data, and determining the Mel spectrogram data as a preprocessing result.

Specifically, the computer device may map the frequency data f of the Y axis in the two-dimensional spectrogram onto the mel scale mel according to a mapping relationship between the frequency data f and the mel scale mel, so as to obtain the mel spectrogram data. Optionally, the mapping relationship may be a proportional relationship, a functional relationship, a logarithmic relationship, an exponential relationship, and/or the like.

In the present embodiment, the mapping relationship between the frequency data f and the mel scale mel can be expressed by the following formula:

mel＝2595*log ₁₀ (1+f/700) (1)；

the numerical value in the formula (1) may be other values, which is not limited.

The voiceprint recognition method can be used for preprocessing the small sample voice data in the small sample voice data training set to obtain the Mel frequency spectrogram data, then the countermeasure network model is generated through Mel frequency spectrogram data training, so that the optimal network parameters corresponding to the large sample can be obtained through the generated countermeasure network model training, and on the basis, the voiceprint recognition model with high generalization capability can be trained through transfer learning.

As one embodiment, the arbiter network in the generation countermeasure network model includes a first convolution layer and a first full connection layer; the initial voiceprint recognition model comprises a second convolution layer and a second full-connection layer; as shown in fig. 7, the step of migrating the network parameters for generating the confrontation network model into the initial voiceprint recognition model in S220 and training the initial voiceprint recognition model through the small sample speech data training set may be implemented by the following steps:

s221, determining the network parameters of the first convolution layer in the discriminator network as the network parameters of the second convolution layer in the initial voiceprint recognition model, and initializing the network parameters of the second full-connection layer.

In the embodiment, the arbiter network in the generation countermeasure network model includes a convolutional layer (i.e. the first convolutional layer) and a fully-connected layer (i.e. the first fully-connected layer); the initial voiceprint recognition model also includes a convolutional layer (i.e., a second convolutional layer) and a fully-connected layer (i.e., a second fully-connected layer). Wherein, the first convolution layer and the second convolution layer can have the same structure; the first full connection layer and the second full connection layer may have the same or different structures.

It should be noted that the computer device may assign the network parameter for generating the first convolution layer in the countermeasure network model to the network parameter for the second convolution layer in the initial voiceprint recognition model, and at the same time, the computer device may initialize the network parameter for the second fully-connected layer in the initial voiceprint recognition model. If the type output by the voiceprint recognition model is the same as the type output by the discriminator network, the network parameter of the first full connection layer in the discriminator network can be determined as the network parameter of the second full connection layer in the initial voiceprint recognition model, and under the condition, the first full connection layer and the second full connection layer have the same structure; if the output type of the voiceprint recognition model is different from the output type of the discriminator network, the network parameters of the second full connection layer in the initial voiceprint recognition model need to be initialized.

S222, training the initial voiceprint recognition model through a small sample voice data training set.

Further, the computer device may train the initial voiceprint recognition model through the speech data in the training set of small sample speech data. Specifically, the computer device may input all the voice data in the small sample voice data training set into the initial voiceprint recognition model to obtain a voiceprint recognition prediction result, calculate a prediction error value between the voiceprint recognition prediction result and the standard voiceprint recognition result through a loss function, update a network parameter in the initial voiceprint recognition model according to the prediction error value, continuously and repeatedly input all the voice data in the small sample voice data training set into the initial voiceprint recognition model after the network parameter is updated until the prediction error value meets a preset error threshold or the iteration number reaches a preset iteration number threshold, and obtain a pre-trained voiceprint recognition model. The standard voiceprint recognition result can be an idealized voiceprint recognition result, namely a gold standard for training a voiceprint recognition model.

The voiceprint recognition method can determine the network parameters of a first convolution layer in the discriminator network as the network parameters of a second convolution layer in the initial voiceprint recognition model, initialize the network parameters of a second full connection layer, and train the initial voiceprint recognition model through a small sample voice data training set to obtain a voiceprint recognition result; the method can transfer the network parameters corresponding to the large sample which is generated and learned by the confrontation network model to the voiceprint recognition model to be trained, so that the knowledge obtained by training the large sample voice data set is introduced during training of the voiceprint recognition model to be trained, the convergence rate of the voiceprint recognition model training can be further accelerated, and the accuracy of voiceprint recognition model recognition can be improved.

In order to facilitate understanding of those skilled in the art, the voiceprint recognition method provided by the present application is described by taking an execution subject as a computer device as an example, and specifically, the method includes:

(1) and framing the small sample voice data in the small sample voice data training set to obtain a plurality of voice frame data.

(2) And respectively carrying out windowing processing on the voice frame data to obtain corresponding windowed data.

(3) And performing Fourier transform on each windowed data to determine a two-dimensional spectrogram.

(4) And mapping the frequency data in the two-dimensional spectrogram onto a Mel scale to obtain Mel spectrogram data, and determining the Mel spectrogram data as a preprocessing result.

(5) And inputting the random noise data into an initial generator network to obtain generated data, and inputting the preprocessing result and the generated data into an initial discriminator network to obtain an initial discrimination prediction result.

(6) And calculating a prediction error value between the initial judgment prediction result and the standard judgment result through a loss function.

(7) And updating network parameters in the initial generator network and the initial arbiter network according to the prediction error value.

(8) And if the predicted error value meets the preset convergence condition, determining that the training of the initial generator network and the training of the initial discriminator network are finished, and obtaining the generator network and the discriminator network.

(9) And determining the network parameters of the first convolution layer in the discriminator network as the network parameters of the second convolution layer in the initial voiceprint recognition model, and initializing the network parameters of the second full-connection layer.

(10) Training the initial voiceprint recognition model through a small sample voice data training set to obtain a voiceprint recognition model; the initial voiceprint recognition model and the convolution layer for generating the countermeasure network model have the same structure.

(11) And acquiring voice data to be recognized.

(12) And inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized.

For the implementation processes of (1) to (12), reference may be specifically made to the description of the above embodiments, and the implementation principles and technical effects thereof are similar and will not be described herein again.

It should be understood that although the various steps in the flow charts of fig. 2-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 8, there is provided a voiceprint recognition apparatus including: voice data acquisition module 11 and voiceprint recognition module 12, wherein:

the voice data acquisition module 11 is used for acquiring voice data to be recognized;

the voiceprint recognition module 12 is configured to input the voice data to be recognized into a preset voiceprint recognition model, so as to obtain a voiceprint recognition result of the voice data to be recognized;

The voiceprint recognition apparatus provided in this embodiment may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

In one embodiment, the voiceprint recognition device further comprises: a first model training module and a second model training module, wherein:

the first model training module is used for training the initially generated confrontation network model through a small sample voice data training set to obtain a generated confrontation network model;

the second model training module is used for transferring the network parameters for generating the confrontation network model to the initial voiceprint recognition model, and training the initial voiceprint recognition model through a small sample voice data training set to obtain the voiceprint recognition model; the initial voiceprint recognition model and the convolution layer for generating the countermeasure network model have the same structure.

In one embodiment, the initially generated confrontation network model comprises an initial generator network and an initial arbiter network; the first model training module includes: a preprocessing unit and a joint training unit, wherein:

the preprocessing unit is used for preprocessing the small sample voice data training set to obtain a preprocessing result;

and the joint training unit is used for inputting the random noise data into the initial generator network to obtain generated data, inputting the preprocessing result and the generated data into the initial discriminator network, and performing joint training on the initial generator network and the initial discriminator network to obtain the generator network and the discriminator network.

In one embodiment, the joint training unit comprises: the device comprises a discriminator network processing subunit, a prediction error value calculating subunit, a network parameter updating subunit and a training end determining subunit, wherein:

the discriminator network processing subunit is used for inputting the preprocessing result and the generated data into an initial discriminator network to obtain an initial discrimination prediction result;

the prediction error value calculation operator unit is used for calculating a prediction error value between the initial judgment prediction result and the standard judgment result through a loss function;

the network parameter updating subunit is used for updating the network parameters in the initial generator network and the initial discriminator network according to the prediction error value;

and the training end determining subunit is used for determining that the training of the initial generator network and the initial discriminator network is finished when the prediction error value meets the preset convergence condition, so as to obtain the generator network and the discriminator network.

In one embodiment, the pre-processing unit comprises: a framing subunit, a windowing subunit, a fourier transform subunit, and a data mapping subunit, wherein:

the framing subunit is used for framing the small sample voice data in the small sample voice data training set to obtain a plurality of voice frame data;

the windowing subunit is used for respectively carrying out windowing processing on the voice frame data to obtain corresponding windowed data;

the Fourier transform subunit is used for performing Fourier transform on each windowed data to determine a two-dimensional spectrogram;

and the data mapping subunit is used for mapping the frequency data in the two-dimensional spectrogram onto a Mel scale to obtain Mel spectrogram data, and determining the Mel spectrogram data as a preprocessing result.

In one embodiment, the arbiter network in the generation countermeasure network model comprises a first convolutional layer and a first fully connected layer; the initial voiceprint recognition model comprises a second convolution layer and a second full-connection layer; the second model training module comprises: a network parameter initialization unit and a voiceprint recognition model training unit, wherein:

the network parameter initialization unit is used for determining the network parameters of the first convolutional layer in the discriminator network as the network parameters of the second convolutional layer in the initial voiceprint recognition model and initializing the network parameters of the second full-connection layer;

and the voiceprint recognition model training unit is used for training the initial voiceprint recognition model through a small sample voice data training set.

For the specific definition of the voiceprint recognition device, reference may be made to the above definition of the voiceprint recognition method, which is not described herein again. The modules in the voiceprint recognition apparatus can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the voice data to be recognized. The network interface of the computer device is used for communicating with an external endpoint through a network connection. The computer program is executed by a processor to implement a voiceprint recognition method.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring voice data to be recognized;

In one embodiment, a readable storage medium is provided, on which a computer program is stored, which computer program, when executed by a processor, performs the steps of:

acquiring voice data to be recognized;

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:

acquiring voice data to be recognized;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of voiceprint recognition, the method comprising:

acquiring voice data to be recognized;

the initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample speech data training set.

2. The voiceprint recognition method according to claim 1, wherein the construction process of the voiceprint recognition model comprises:

training an initially generated confrontation network model through the small sample voice data training set to obtain the generated confrontation network model;

migrating the network parameters of the generated confrontation network model to an initial voiceprint recognition model, and training the initial voiceprint recognition model through the small sample voice data training set to obtain the voiceprint recognition model; wherein the initial voiceprint recognition model and the convolution layer generating the countermeasure network model have the same structure.

3. The voiceprint recognition method of claim 2 wherein said initially generated confrontation network model comprises an initial generator network and an initial discriminator network;

then, the training the initially generated confrontation network model through the small sample voice data training set to obtain a generated confrontation network model, including:

preprocessing the small sample voice data training set to obtain a preprocessing result;

and inputting random noise data into the initial generator network to obtain generated data, inputting the preprocessing result and the generated data into the initial discriminator network, and performing joint training on the initial generator network and the initial discriminator network to obtain the generator network and the discriminator network.

4. The voiceprint recognition method according to claim 3, wherein the inputting the preprocessing result and the generated data into the initial discriminator network, and performing joint training on the initial generator network and the initial discriminator network to obtain the generator network and the discriminator network comprises:

inputting the preprocessing result and the generated data into the initial discriminator network to obtain an initial discrimination prediction result;

updating network parameters in the initial generator network and the initial arbiter network according to the prediction error value;

and if the prediction error value meets a preset convergence condition, determining that the training of the initial generator network and the training of the initial discriminator network are finished, and obtaining the generator network and the discriminator network.

5. The voiceprint recognition method according to claim 3, wherein the preprocessing the training set of small sample speech data to obtain a preprocessing result comprises:

windowing each voice frame data to obtain corresponding windowed data;

and mapping the frequency data in the two-dimensional spectrogram onto a Mel scale to obtain Mel spectrogram data, and determining the Mel spectrogram data as the preprocessing result.

6. The voiceprint recognition method according to any one of claims 2 to 5, wherein the discriminator network in the generative confrontation network model comprises a first convolutional layer and a first fully connected layer; the initial voiceprint recognition model comprises a second convolutional layer and a second full-connection layer;

then, the migrating the network parameters of the generated confrontation network model to an initial voiceprint recognition model, and training the initial voiceprint recognition model through the small sample speech data training set includes:

determining the network parameters of the first convolution layer in the discriminator network as the network parameters of the second convolution layer in the initial voiceprint recognition model, and initializing the network parameters of the second full connection layer;

and training the initial voiceprint recognition model through the small sample voice data training set.

7. A voiceprint recognition apparatus, said apparatus comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the voiceprint recognition method according to any one of claims 1 to 6.

9. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the voiceprint recognition method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1-6.