CN112820299A

CN112820299A - Voiceprint recognition model training method and device and related equipment

Info

Publication number: CN112820299A
Application number: CN202011594311.7A
Authority: CN
Inventors: 陈燕丽; 王洪斌; 蒋宁; 吴海英
Original assignee: Mashang Consumer Finance Co Ltd
Current assignee: Mashang Consumer Finance Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-05-18
Anticipated expiration: 2040-12-29
Also published as: CN112820299B

Abstract

The application provides a voiceprint recognition model training method, a voiceprint recognition model training device and related equipment, wherein the method comprises the following steps: randomly selecting M first voiceprint sample data in a sample pool, wherein each first voiceprint sample data comprises a sampled probability value; inputting the first voiceprint sample data into a pre-training voiceprint recognition model, and performing nth iteration training; determining a probability value for adjusting the sampling of the first voiceprint sample data based on a classification result output by the pre-training voiceprint recognition model; under the condition that the pre-training voiceprint recognition model after the Nth iterative training is converged, determining the pre-training voiceprint recognition model after the Nth iterative training as a voiceprint recognition model; and the first voiceprint sample data after the sampled probability value is adjusted is used for determining input data of the (N + 1) th iterative training, wherein M and N are positive integers. Thus, the training precision of the trained voiceprint recognition model can be improved.

Description

Voiceprint recognition model training method and device and related equipment

Technical Field

The present application relates to the field of voiceprint recognition technologies, and in particular, to a method and an apparatus for training a voiceprint recognition model, and a related device.

Background

Voiceprint recognition is used as a credible voiceprint feature authentication technology, and has wide application prospects in various fields and scenes such as identity authentication, safety verification and the like. However, the voice is easily affected by external environments such as various noise environments, emotions, physical conditions and the like and self factors, so that the method has extremely important practical significance for improving the accuracy of voiceprint recognition. In the course of implementing the present application, the applicant has found that the following technical problems exist in the prior art: in the training of the voiceprint recognition model, the voiceprint recognition model is easily interfered by large-scale unbalanced data, so that the training precision of the voiceprint recognition model obtained through training is low.

Disclosure of Invention

The embodiment of the application provides a voiceprint recognition model training method, a voiceprint recognition model training device and related equipment, and aims to solve the problem that training accuracy of a voiceprint recognition model obtained through training is low.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a method for training a voiceprint recognition model, including:

randomly selecting M first voiceprint sample data in a sample pool, wherein each first voiceprint sample data comprises a sampled probability value;

inputting the first voiceprint sample data into a pre-training voiceprint recognition model, and performing nth iteration training;

determining a probability value for adjusting the sampling of the first voiceprint sample data based on a classification result output by the pre-training voiceprint recognition model;

under the condition that the pre-training voiceprint recognition model after the Nth iterative training is converged, determining the pre-training voiceprint recognition model after the Nth iterative training as a voiceprint recognition model;

and the first voiceprint sample data after the sampled probability value is adjusted is used for determining input data of the (N + 1) th iterative training, wherein M and N are positive integers.

In a second aspect, an embodiment of the present application provides a voiceprint recognition method, including:

acquiring voiceprint data corresponding to first voice data to be recognized;

inputting the voiceprint data into a voiceprint recognition model to obtain a voiceprint characteristic vector to be confirmed;

inputting the voiceprint feature vector into a preset classification model to obtain a first classification result;

determining that the first voice data is voice data of a first user under the condition that the first classification result is matched with a reference result corresponding to the first user;

and the voiceprint recognition model is obtained by training based on the voiceprint recognition model training method.

In a third aspect, an embodiment of the present application provides a training apparatus for a voiceprint recognition model, including:

the selecting module is used for randomly selecting M first voiceprint sample data in a sample pool, wherein each first voiceprint sample data comprises a sampled probability value;

the training module is used for inputting the first voiceprint sample data into a pre-training voiceprint recognition model and performing Nth iteration training;

a first determining module, configured to determine, based on a classification result output by the pre-training voiceprint recognition model, a probability value for adjusting sampling of the first voiceprint sample data;

the second determining module is used for determining the pre-training voiceprint recognition model after the Nth iterative training as the voiceprint recognition model under the condition that the pre-training voiceprint recognition model after the Nth iterative training is converged;

In a fourth aspect, an embodiment of the present application further provides an electronic device, including: the training system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the voiceprint recognition model training method when being executed by the processor.

In a fifth aspect, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements the steps in the above training method for the voiceprint recognition model.

In the embodiment of the application, randomly selecting M first voiceprint sample data in a sample pool, wherein each first voiceprint sample data comprises a sampled probability value; inputting the first voiceprint sample data into the pre-training voiceprint recognition model, and performing nth iteration training; determining a probability value for adjusting the sampling of the first voiceprint sample data based on a classification result output by the pre-training voiceprint recognition model; under the condition that the pre-training voiceprint recognition model after the Nth iterative training is converged, determining the pre-training voiceprint recognition model after the Nth iterative training as a voiceprint recognition model; and the first voiceprint sample data after the sampled probability value is adjusted is used for determining input data of the (N + 1) th iterative training, wherein M and N are positive integers. Therefore, by adjusting the probability value of the sampled first voiceprint sample data and determining the input data of the (N + 1) th iterative training according to the first voiceprint sample data after the sampled probability value is adjusted, the interference of large-scale unbalanced data is reduced, and the training precision of the voiceprint recognition model obtained by training is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of a method for training a voiceprint recognition model according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of another training method for a voiceprint recognition model provided by an embodiment of the present application;

FIG. 3 is a flowchart of another training method for a voiceprint recognition model according to an embodiment of the present application;

FIG. 4 is a flowchart of another training method for a voiceprint recognition model provided by an embodiment of the present application;

FIG. 5 is a flowchart of another training method for a voiceprint recognition model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of input data and output data of a voiceprint recognition model provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a voiceprint recognition model provided in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another voiceprint recognition model provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of another voiceprint recognition model provided in the embodiment of the present application;

FIG. 10 is a schematic structural diagram of another voiceprint recognition model provided by an embodiment of the present application;

fig. 11 is a flowchart of a voiceprint recognition method provided by an embodiment of the application;

fig. 12 is a schematic structural diagram of a training apparatus for a voiceprint recognition model according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a voiceprint recognition apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart of a voiceprint recognition model training method provided in an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:

step 101, randomly selecting M first voiceprint sample data in a sample pool, wherein each first voiceprint sample data comprises a sampled probability value.

All the voiceprint sample data to be trained in the sample pool can be referred to as full data, and the M first voiceprint sample data belong to partial data in the sample pool.

In addition, the M first voiceprint sample Data belong to Data under different classifications, that is, the M first voiceprint sample Data may be at least a part of the adaptively found valuable samples, so that the adaptively found valuable samples form a small batch of Data, and the adaptively found valuable samples may be referred to as Adaptive Data Sampling (ADS).

Each first voiceprint sample data comprises a sampled probability value, namely each first voiceprint sample data corresponds to the sampled probability value, so that each first voiceprint sample data comprises the sampled probability value, the sampled probability value corresponding to each first voiceprint sample data is convenient to identify subsequently, and the corresponding sampled probability value is adjusted.

And 102, inputting the first voiceprint sample data into a pre-training voiceprint recognition model, and performing nth iteration training.

And converting the M first voiceprint sample data into audio features in a matrix form, inputting the audio features into a pre-training voiceprint recognition model, and performing iterative training.

As an optional implementation manner, the pre-trained voiceprint recognition model is a voiceprint recognition model after being trained by using full data, and the first voiceprint sample data belongs to data in the full data.

In the embodiment, the pre-training voiceprint recognition model obtained by full data training can be a good baseline voiceprint recognition model, the model already learns the basic knowledge of voiceprint recognition, and has a certain recognition degree on the voiceprint recognition of voice, namely, the initial recognition of the voiceprint can be carried out. Therefore, the pre-training of the pre-training voiceprint recognition model is equivalent to the pre-training of the pre-training voiceprint recognition model, and the training period for training the pre-training voiceprint recognition model is shortened.

For example: referring to fig. 3, fig. 3 is a flowchart illustrating training of a voiceprint recognition model by using full data to obtain a pre-trained voiceprint recognition model in the present embodiment, it can be understood that the voiceprint recognition model in the training of the voiceprint recognition model in fig. 3 refers to an initial voiceprint recognition model, and after the training by using full data, the voiceprint recognition model converged in fig. 3 is the pre-trained voiceprint recognition model in the present embodiment.

The training mode of the pre-training voiceprint recognition model obtained by training with the full-scale data can be described as follows:

from the full amount of data, randomly selecting a batch to train the initial voiceprint recognition model until all data are trained to be an epoch, and then circulating a plurality of epochs until the accuracy of the model training and verification tends to be stable, so that the trained model is determined to be the pre-trained voiceprint recognition model in the embodiment.

That is to say: the full data can be divided into a plurality of groups of data, each group of data comprises a plurality of data, one group of data can be called a batch, the plurality of groups of data are trained once, the data are called an epoch, the full data are divided into a plurality of groups of data, and the data are input into the initial voiceprint recognition model in turn to be trained until the accuracy of the training and verification of the model tends to be stable.

And 103, determining and adjusting the sampled probability value of the first voiceprint sample data based on the classification result output by the pre-training voiceprint recognition model.

It should be noted that the sampled probability value of each first voiceprint sample data cannot be lower than the lowest threshold, so that it can be ensured that each first voiceprint sample data can be acquired, and the phenomenon that the sampled probability value of some first voiceprint sample data is too low to be acquired and further cannot be input to the pre-training voiceprint recognition model for training is avoided.

Referring to fig. 2 and 4, the probability value of the first voiceprint sample data being sampled may be referred to as a weight, and the first voiceprint sample data may be referred to as data for short, and the acquired data is input into a voiceprint recognition model (i.e., a pre-trained voiceprint recognition model in the present embodiment) for training.

The pre-training voiceprint recognition model can output the classification result of each first voiceprint sample data, and the sampled probability value of the corresponding first voiceprint sample data is determined according to the classification result. That is, the classification result output by the pre-trained voiceprint recognition model is the classification result of the input first voiceprint sample data, that is, the two are in one-to-one correspondence, that is, each first voiceprint sample data is input into the pre-trained voiceprint recognition model, and the model outputs the classification result of the first voiceprint sample data, of course, the classification result usually includes two results, i.e., a correct classification result or an incorrect classification result.

In addition, the classification result output by the pre-training voiceprint recognition model can output a probability value that the first voiceprint sample data belongs to a certain classification, when the probability value is greater than or equal to a preset value, the classification result is determined to be correct, and when the probability value is smaller than the preset value, the classification result is determined to be wrong; for example: when the probability value is 0.6 and is greater than the preset value of 0.5, the classification result is correct; when the probability value is 0.4 and is less than the preset value of 0.5, the classification result is wrong at the moment.

Of course, the classification result output by the pre-training voiceprint recognition model may also output a result of whether the first voiceprint sample data belongs to a certain classification. For example: when whether the output first voiceprint sample data belongs to the first classification or not is judged, the classification result can be 'yes', and the fact that the first voiceprint sample data belongs to the first classification is indicated; and when the classification result is 'no', the first voiceprint sample data does not belong to the first classification.

In addition, the pre-trained voiceprint recognition model may further output a plurality of probability values, and determine, from the plurality of probability values, a classification corresponding to the maximum value as the classification of the first voiceprint sample data, for example: the classification result includes a probability value of 0.2 belonging to the first class, a probability value of 0.3 belonging to the second class, and a probability value of 0.5 belonging to the third class, so that it can be determined that the third class corresponding to 0.5 is the class of the first voiceprint sample data.

It should be noted that, a specific manner of determining the sampled probability value of the corresponding first voiceprint sample data according to the classification result is not limited herein.

For example: as an optional implementation, the determining, based on the classification result output by the pre-trained voiceprint recognition model, a probability value for adjusting the sampling of the first voiceprint sample data includes:

under the condition that the classification result output by the pre-training voiceprint recognition model is correct, reducing the sampled probability value of the first voiceprint sample data;

and under the condition that the classification result output by the pre-training voiceprint recognition model is wrong, increasing the probability value of the first voiceprint sample data which is sampled, or rejecting the first voiceprint sample data.

When the probability value of the sampled first voiceprint sample data is adjusted to be lower or higher, a fixed value can be adjusted each time, and it should be noted that the fixed values corresponding to different first voiceprint sample data may be the same or different.

In addition, when the probability value of the sampled first voiceprint sample data is adjusted to be lower or higher, one variable value may be adjusted each time, and a value of the variable value is related to the number of times of inputting the corresponding first voiceprint sample data into the pre-training voiceprint recognition model, that is, the more times of inputting the first voiceprint sample data, the larger or lower the value of the variable value is, and a specific corresponding manner is not limited herein.

In this embodiment, when the classification result output by the pre-training voiceprint recognition model is correct, it indicates that the pre-training voiceprint recognition model can correctly recognize the first voiceprint sample data, so that the probability value of the first voiceprint sample data being sampled can be reduced in the subsequent training process; when the classification result output by the pre-training voiceprint recognition model is wrong, the pre-training voiceprint recognition model cannot correctly recognize the first voiceprint sample data, so that the probability value of the first voiceprint sample data to be sampled can be improved in the subsequent training process, the pre-training voiceprint recognition model can acquire more first voiceprint sample data, the first voiceprint sample data is trained, and the accuracy of the pre-training voiceprint recognition model in recognizing the first voiceprint sample data is improved.

As an optional implementation manner, in the case that the classification result output by the pre-trained voiceprint recognition model is incorrect, the increasing the probability value of the sampled first voiceprint sample data, or rejecting the first voiceprint sample data includes:

determining a target parameter of the first voiceprint sample data under the condition that a classification result output by the pre-training voiceprint recognition model is wrong;

under the condition that the target parameter meets a preset condition, increasing the sampled probability value of the first voiceprint sample data;

or, under the condition that the target parameter does not meet the preset condition, the first voiceprint sample data is rejected.

In this way, when the target parameter meets the preset condition, it is indicated that the first voiceprint sample data corresponding to the target parameter is excellent sample data, but the first voiceprint sample data is classified incorrectly, and therefore, the first voiceprint sample data needs to be continuously acquired in a subsequent process, the sampled probability value of the first voiceprint sample data is increased, and the first voiceprint sample data is input into the pre-training voiceprint recognition model for training, so that the pre-training voiceprint recognition model can correctly classify the first voiceprint sample data. Correspondingly, when the target parameter does not meet the preset condition, it is indicated that the first voiceprint sample data corresponding to the target parameter at this time is poor sample data, and the first voiceprint sample data is classified incorrectly, so that the acquisition probability of the first voiceprint sample data in subsequent training needs to be reduced, and the first voiceprint sample data can be removed.

The specific type of the target parameter is not limited herein, and for example: as an optional implementation manner, the target parameter may be a sampled probability value of the first voiceprint sample data, and when a difference value between the sampled probability value of the first voiceprint sample data and a preset probability value is greater than or equal to a first preset value, it is determined that the target parameter meets a preset condition; and under the condition that the difference value of the sampled probability value of the first voiceprint sample data and the preset probability value is smaller than a first preset value, determining that the target parameter does not meet the preset condition.

The specific values of the first preset value are not limited herein, for example: the first preset value may be 0.

It should be noted that the preset probability value may be an average value or a weighted average value of probability values corresponding to a plurality of voiceprint sample data in the sample pool, where the plurality of voiceprint sample data may be the voiceprint sample data with a correct labeling classification result, or may be labeled as the sample data belonging to the same classification as the first voiceprint sample data.

As another optional implementation, the target parameter may be a sampled probability value of the first voiceprint sample data, and when the sampled probability value of the first voiceprint sample data is greater than or equal to a second preset value, it is determined that the target parameter meets a preset condition; and under the condition that the sampled probability value of the first voiceprint sample data is smaller than a second preset value, determining that the target parameter does not meet the preset condition.

As another optional implementation, the target parameter may be a probability value that the first voiceprint sample data belongs to a certain classification after the first voiceprint sample data is input into a pre-training voiceprint recognition model, and when the probability value that the first voiceprint sample data belongs to the certain classification is greater than or equal to a third preset value, it is determined that the target parameter meets a preset condition; and under the condition that the sampled probability value of the first voiceprint sample data is smaller than a third preset value, determining that the target parameter does not meet the preset condition.

Referring to fig. 4, the probability value of the first voiceprint sample data being sampled may be a score in fig. 4, the second preset value is a given threshold in fig. 4, and whether the iteration is correct refers to whether the classification result in the embodiment is correct.

As an optional implementation manner, in the case that the classification result output by the pre-trained voiceprint recognition model is incorrect, determining the target parameter of the first voiceprint sample data includes:

determining the similarity between the first voiceprint sample data and corresponding preset voiceprint sample data as the target parameter under the condition that the classification result output by the pre-training voiceprint recognition model is wrong;

the increasing the probability value of the first voiceprint sample data under the condition that the target parameter meets a preset condition includes:

and under the condition that the similarity between the first voiceprint sample data and the corresponding preset voiceprint sample data is greater than a first threshold value, increasing the sampled probability value of the first voiceprint sample data.

The corresponding preset voiceprint sample data can be regarded as voiceprint sample data with labeling information, and the labeling information can be used for representing classification information of the voiceprint sample data, namely the preset voiceprint sample data is the voiceprint sample data with correct classification.

In the present embodiment, when the classification result of the first voiceprint sample data output by the pre-training voiceprint recognition model is incorrect, judging the similarity between the first voiceprint sample data and the corresponding preset voiceprint sample data, when the similarity is greater than the first threshold, it indicates that there is more valid data in the first voiceprint sample data, where the valid data may refer to a portion of data coinciding with the first voiceprint sample data and a preset voiceprint sample data, and at this time, indicates that the first voiceprint sample data is valuable voiceprint sample data, therefore, the probability value of the first voiceprint sample data being sampled can be increased, so as to facilitate the subsequent training process, and obtaining the first voiceprint sample data by the model again, and training the first voiceprint sample data to enable the model to correctly classify the first voiceprint sample data, namely enabling the model to correctly identify the first voiceprint sample data.

And when the similarity is higher, the occupation ratio of effective data in the first voiceprint sample data is higher.

As another optional implementation manner, the rejecting the first voiceprint sample data when the target parameter does not meet a preset condition includes:

and under the condition that the classification result output by the pre-training voiceprint recognition model is wrong and the similarity between the first voiceprint sample data and the corresponding preset voiceprint sample data is less than or equal to a first threshold value, rejecting the first voiceprint sample data.

In this embodiment, since the classification result output by the pre-training voiceprint recognition model is incorrect, and the similarity between the first voiceprint sample data and the corresponding preset voiceprint sample data is less than or equal to the first threshold, it is indicated that the first voiceprint sample data includes less valid data, so that the first voiceprint sample data can be determined as invalid data or data which does not need to be learned, and thus the first voiceprint sample data is rejected, so as to reduce the probability that the first voiceprint sample data is obtained by the model again in the subsequent training process, thereby reducing the number of sample data which can be obtained in the whole sample pool, and further reducing the occupation of computing resources.

It should be noted that the valid data may refer to corresponding expressions in the previous embodiments, and details are not described herein again.

104, under the condition that the pre-training voiceprint recognition model after the Nth iterative training is converged, determining the pre-training voiceprint recognition model after the Nth iterative training as a voiceprint recognition model; and the first voiceprint sample data after the sampled probability value is adjusted is used for determining input data of the (N + 1) th iterative training, wherein M and N are positive integers.

The method for judging whether the pre-training voiceprint recognition model is converged is not limited, for example, an error generated in the training process is smaller than a preset smaller value; the weight value change between two times of iterative training is very small, for example, a threshold value is set, and when the weight value between two times of iterative training is smaller than the threshold value, the training is stopped; setting the maximum iterative training times, stopping training when the iterative training times exceed the maximum iterative training times, and the like.

It should be noted that after the voiceprint recognition model is determined, the voiceprint recognition model may be further trained to further improve the recognition accuracy of the voiceprint recognition model.

Optionally, after determining the pre-trained voiceprint recognition model after the nth iteration training as the voiceprint recognition model, the method further includes:

and randomly acquiring L voiceprint sample data sets, wherein each voiceprint sample data set comprises voiceprint sample data of at least two users, the similarity of the voiceprint sample data of the at least two users is greater than a second threshold, and L is a positive integer.

Training the voiceprint recognition model by utilizing the L voiceprint sample data sets;

and under the condition that the trained voiceprint recognition model is converged, determining the trained voiceprint recognition model as a target voiceprint recognition model.

The similarity between the voiceprint sample data included in any two voiceprint sample data sets is greater than a third threshold, that is to say: the similarity between the voiceprint sample data included in any two voiceprint sample data sets is high, so that the discrimination is low, the voiceprint sample data included in the voiceprint sample data sets can be acquired for training the voiceprint recognition model, and therefore the recognition accuracy and the recognition accuracy of the voiceprint recognition model are further improved.

The similarity of the voiceprint sample data of at least two users in each voiceprint sample data set is greater than the second threshold, so that the similarity of the voiceprint sample data in each voiceprint sample data set is higher.

In this way, in the present embodiment, the recognition accuracy and precision of the trained target voiceprint recognition model can be further improved.

Optionally, the voiceprint sample data of the at least two users includes voiceprint sample data of a target user, other users except the target user belong to users in the queue of the target user, and the similarity between the voiceprint sample data of the users in the queue of the target user and the voiceprint sample data of the target user is greater than a second threshold.

Therefore, the target queue users similar to each target user can be placed in the corresponding target user queue, so that when voiceprint sample data is selected, a plurality of target users can be randomly determined, then a plurality of target queue users in the queue of each target user are determined, and the voiceprint sample data of each target user and each target queue user is selected, so that the speed of selecting the voiceprint sample data is higher, meanwhile, the similarity of the selected voiceprint sample data is higher, and the recognition precision and the accuracy of the voiceprint recognition model are conveniently improved.

As an optional implementation manner for selecting voiceprint sample data, the randomly acquiring L voiceprint sample data sets includes:

randomly determining S target users;

determining K target queue users in a queue corresponding to each target user;

and randomly selecting I-item voiceprint information of each target user and each corresponding target queue user to form a voiceprint sample data set.

Therefore, the target users are determined first, and then the target queue user corresponding to each target user is determined, so that the speed of selecting the voiceprint sample data is higher, and meanwhile, the similarity of the selected voiceprint sample data is higher.

Referring to fig. 2 and 5, S target users may be determined first, then K target queue users in a queue similar to each target user are determined, and for all users of S × K, each user randomly selects I pieces of voiceprint sample data to input into the voiceprint recognition model for training.

It should be noted that, as an optional implementation manner, the voiceprint sample data included in the L voiceprint sample data sets belongs to data in the full amount of data. That is to say, the voiceprint sample data included in the L voiceprint sample data sets may belong to a part of the full amount of data, and may also be referred to as a small batch of data. As the full data belongs to the data in the sample pool, only the full data is determined, and then L voiceprint sample data sets are determined from the full data, and other data do not need to be searched, the workload is reduced, and the occupation of computing resources is reduced.

In the embodiment of the application, the probability value of the sampled first voiceprint sample data is adjusted according to the classification result output by the pre-training voiceprint recognition model, so that a first feedback mechanism is established, the model can pay more attention to the voiceprint sample data with the wrong classification result in the later training, and the model can learn valuable samples such as the voiceprint sample data with the wrong classification result. In addition, a voiceprint sample data set is determined according to the voiceprint sample data of the target user and the corresponding target queue user, and the voiceprint recognition model is trained by adopting the voiceprint sample data set, so that the model pays more attention to sample data with larger similarity in the training process, and the acquired voiceprint sample data set can be updated after each training, and a second feedback mechanism is established.

It can also be understood that: the application provides two voiceprint recognition model training methods based on difficult mining (hard mining), and the first method is to establish a feedback mechanism from data sampling, so that the model can pay more attention to a difficult sample (hard sample) in the later period to learn valuable samples. The second hard mining proceeds from class level, and similarly establishes a feedback mechanism to let the model focus on the similar classes in the training process. The voiceprint recognition training method based on hard mining can effectively improve recognition accuracy and shorten training time.

It should be noted that the pre-trained voiceprint recognition model and the voiceprint recognition model in the embodiment of the application may be both of a ResNet structure, and introduction of a dynamic routing layer included in the ResNet structure not only deepens and trains the number of network layers, but also greatly reduces network parameters, improves network performance, and effectively improves network efficiency. For a service scenario, a related network design may be performed. And voiceprint recognition features extracted based on ResNet (otherwise referred to as voiceprint recognition data) are referred to as r-vector. For the ResNet structure, there are two basic dimensions that control the model capacity, namely width and depth.

For example: referring to fig. 7, fig. 7 may correspond to a structural schematic diagram of ResNet, which may also be referred to as a Basic block, including at least two layers of 3X3 convolutional layers. In addition, referring to fig. 6, fig. 6 is a system architecture diagram of a voiceprint recognition model, i.e., fig. 6 may be a schematic diagram of input data and output data. In fig. 6, a module may be referred to as a module, an output size may be referred to as an output size or output size of a model (i.e., may be understood as a probability value), an input layer may be referred to as an input layer, an aggregation layer may be referred to as an aggregation layer, an average may be referred to as an average, an attribute may be referred to as an attention, a classification layer may be referred to as a classification layer, and conv may be used to represent a vector convolution operation fc and may represent a connection layer.

Of course, as another optional implementation manner, the pre-trained voiceprint recognition model and the voiceprint recognition model each include a first convolutional layer, a second convolutional layer, and a third convolutional layer, the second convolutional layer is located between the first convolutional layer and the third convolutional layer, and the second convolutional layer is a rennext structure or a Res2Net structure.

The step of acquiring voiceprint sample data in the embodiment of the present application may be implemented on the data layer, and other steps in addition may be implemented on the first convolution layer, the second convolution layer, and the third convolution layer.

The first convolutional layer may have a structure of 3X3 or 1X1, and the third convolutional layer may also have a structure of 3X3 or 1X 1. For example: referring to fig. 8, fig. 8 includes a first convolutional layer 801, a second convolutional layer 802, and a third convolutional layer 803, wherein the first convolutional layer 801 and the third convolutional layer 803 are both 1 × 1 convolutional layers; the second convolutional layer 802 may be a 3X3 convolutional layer and may be referred to as a groups C module layer, that is, the second convolutional layer 802 in fig. 8 may be understood as a renex structure.

Referring to fig. 9, fig. 9 includes a first convolutional layer 901, a second convolutional layer 902, and a third convolutional layer 903, where the first convolutional layer 901 is a 3X3 convolutional layer, the third convolutional layer 903 is a 1X1 convolutional layer, and the second convolutional layer 902 can be understood as a Res2Net structure.

It will be appreciated that resenxt is equivalent to replacing the residual block in resenet by a multi-branch transition, thereby introducing many convolution groups in one layer. Res2Net also focuses on the redesign of the residue. It constructs a layered residual-like connection inside the residual block and provides multiple available receive fields within one layer. In addition to width and depth, ResNeXt and Res2Net disclose two additional dimensions, called cardinality and scale, respectively.

Referring specifically to fig. 8, the second convolutional layer 802 in fig. 8 is a multi-branch transform with a superparameter C (referred to as the radix). The method is a new basic dimension, the training efficiency is effectively improved by increasing the base number than by increasing the depth and the width of the network, and the appearance of an overfitting phenomenon is reduced.

Referring to fig. 9, the Res2Net architecture can represent multi-scale features at a finer level of granularity and increase the receptive field of each network using a multi-scale approach. And modifying the basic block of the ResNet structure in the voiceprint recognition system based on the Resnet architecture into a Res2Net block. For example: the first convolutional layer in fig. 8 is set as the first convolutional layer 901 in fig. 9, and the first convolutional layer 901 is a 3 × 3 convolutional layer, and the second convolutional layer 802 in fig. 8 is replaced with a Res2Net module.

For example: fig. 10 illustrates the design of the Res2Net module. After the first 3x3 convolutional layer, the feature map is uniformly sliced into s subsets, denoted by { x1, x 2. Each subset is then fed to a 3x3 convolution (denoted by Ki) except for x 1. Starting at x3, each output of Ki-1 is added to xi before passing Ki. This layered residual-like connection further increases the possible receptive field within one layer, thereby enabling more comprehensive voiceprint characteristics to be obtained. Meanwhile, the multi-scale feature representation of Res2Net can greatly improve the performance of short words and has stronger robustness to background noise.

The method and the device improve a voiceprint recognition system based on a Resnet framework, and integrate ResNeXt and Res2Net structures into an existing ResNet structure, wherein the ResNeXt has multi-branch transformation of a hyperparameter C (called a radix). Increasing the cardinality can effectively improve the accuracy of voiceprint recognition. The layered residual-like connection of Res2Net further increases the possible receptive field within one layer, thereby enabling more comprehensive voiceprint characteristics to be obtained. Meanwhile, the multi-scale feature representation of Res2Net can greatly improve the performance of short words and has stronger robustness to background noise. The accuracy of the r-vector voiceprint recognition model recognition is further improved.

Thus, in the embodiment of the application, the core idea of ResNeXt and Res2Net is introduced into the traditional ResNet architecture, so that the characteristics are utilized to the maximum extent, and better performance is obtained with fewer parameters.

Referring to fig. 11, an embodiment of the present application further provides a voiceprint recognition method, including the following steps:

step 1101, acquiring voiceprint data corresponding to first voice data to be recognized;

the first voice data may be collected voice data of a user, or may also be obtained voice data in a sample pool.

Step 1102, inputting the voiceprint data into a voiceprint recognition model to obtain a voiceprint feature vector to be confirmed;

the voiceprint feature vector to be confirmed can be in a matrix form, and the voiceprint recognition model is obtained by training based on the voiceprint recognition model training method provided by the embodiment.

1103, inputting the voiceprint feature vectors into a preset classification model to obtain a first classification result;

the type of classification model is not limited herein, and for example: the classification model may be in the form of a Softmax classifier or the like.

And 1104, determining that the first voice data is the voice data of the first user when the first classification result is matched with a reference result corresponding to the first user.

The reference result corresponding to the first user may be a correct classification result annotated in advance.

Therefore, the classification result of the first voice data to be recognized can be more accurate through the steps.

Referring to fig. 12, fig. 12 is a structural diagram of a training apparatus for a voiceprint recognition model according to an embodiment of the present application, which can implement details of a training method for a voiceprint recognition model in the foregoing embodiment and achieve the same effect. As shown in fig. 12, the voiceprint recognition model training apparatus 1200 includes:

a selecting module 1201, configured to randomly select M first voiceprint sample data in a sample pool, where each first voiceprint sample data includes a sampled probability value;

a training module 1202, configured to input the first voiceprint sample data to a pre-training voiceprint recognition model, and perform nth iteration training;

a first determining module 1203, configured to determine, based on a classification result output by the pre-training voiceprint recognition model, a probability value for adjusting the sampling of the first voiceprint sample data;

a second determining module 1204, configured to determine the pre-trained voiceprint recognition model after the nth iterative training as the voiceprint recognition model under the condition that the pre-trained voiceprint recognition model after the nth iterative training is converged;

Optionally, the first determining module 1203 includes:

the reduction sub-module is used for reducing the sampled probability value of the first voiceprint sample data under the condition that the classification result output by the pre-training voiceprint recognition model is correct;

and the adjusting submodule is used for increasing the sampled probability value of the first voiceprint sample data or rejecting the first voiceprint sample data under the condition that the classification result output by the pre-training voiceprint recognition model is wrong.

Optionally, the adjusting sub-module comprises:

the determining subunit is configured to determine a target parameter of the first voiceprint sample data when the classification result output by the pre-training voiceprint recognition model is incorrect;

the heightening subunit is used for heightening the sampled probability value of the first voiceprint sample data under the condition that the target parameter meets a preset condition;

or, the rejecting subunit is configured to reject the first voiceprint sample data when the target parameter does not meet a preset condition.

Optionally, the training apparatus 1200 for voiceprint recognition model further comprises:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for randomly acquiring L voiceprint sample data sets, each voiceprint sample data set comprises voiceprint sample data of at least two users, the similarity of the voiceprint sample data of the at least two users is greater than a second threshold, and L is a positive integer.

The training module is used for training the voiceprint recognition model by utilizing the L voiceprint sample data sets;

and the third determining module is used for determining the trained voiceprint recognition model as the target voiceprint recognition model under the condition that the trained voiceprint recognition model is converged.

Optionally, the pre-trained voiceprint recognition model is a voiceprint recognition model trained by using full data, and the voiceprint sample data included in the first voiceprint sample data and the voiceprint sample data included in the L voiceprint sample data sets both belong to data in the full data.

Optionally, the pre-trained voiceprint recognition model and the voiceprint recognition model each include a first convolutional layer, a second convolutional layer, and a third convolutional layer, the second convolutional layer is located between the first convolutional layer and the third convolutional layer, and the second convolutional layer is a renex structure or a Res2Net structure.

The voiceprint recognition model training device provided by the embodiment of the application can realize each process realized by the voiceprint recognition model training device in the method embodiment of fig. 1, and is not repeated here for avoiding repetition.

Optionally, referring to fig. 13, an embodiment of the present application further provides a schematic structural diagram of a voiceprint recognition apparatus, as shown in fig. 13, a voiceprint recognition apparatus 1300 includes:

a second obtaining module 1301, configured to obtain voiceprint data corresponding to the first voice data to be recognized;

a first input module 1302, configured to input the voiceprint data into a voiceprint recognition model, so as to obtain a voiceprint feature vector to be confirmed;

the second input module 1303 is configured to input the voiceprint feature vector to a preset classification model to obtain a first classification result;

a fourth determining module 1304, configured to determine that the first voice data is voice data of the first user when the first classification result matches a reference result corresponding to the first user;

wherein the voiceprint recognition model is trained based on the voiceprint recognition model training method of any one of claims 1 to 7.

Therefore, the voiceprint recognition device can enable the classification result of the first voice data to be recognized to be accurate.

Fig. 14 is a schematic hardware structure diagram of an electronic device implementing various embodiments of the present application.

The electronic device 1400 includes, but is not limited to: radio frequency unit 1401, network module 1402, audio output unit 1403, input unit 1404, sensor 1405, display unit 1406, user input unit 1407, interface unit 1408, memory 1409, processor 1410, and power supply 1411. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 14 does not constitute a limitation of the electronic device, and that the electronic device may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present application, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

Wherein, the processor 1410 is configured to perform the following operations:

It should be understood that, in the embodiment of the present application, the radio frequency unit 1401 may be configured to receive and transmit signals during a message transmission or call process, and specifically, receive downlink data from a base station and then process the received downlink data to the processor 1410; in addition, the uplink data is transmitted to the base station. In general, radio unit 1401 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. The radio unit 1401 may also communicate with a network and other devices via a wireless communication system.

The electronic device provides wireless broadband internet access to the user through the network module 1402, such as helping the user send and receive e-mails, browse webpages, access streaming media, and the like.

The audio output unit 1403 can convert audio data received by the radio frequency unit 1401 or the network module 1402 or stored in the memory 1409 into an audio signal and output as sound. Also, the audio output unit 1403 may also provide audio output related to a specific function performed by the electronic device 1400 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 1403 includes a speaker, a buzzer, a receiver, and the like.

The input unit 1404 is for receiving an audio or video signal. The input Unit 1404 may include a Graphics Processing Unit (GPU) 14041 and a microphone 14042, the Graphics processor 14041 Processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 1406. The image frames processed by the graphics processor 14041 may be stored in the memory 1409 (or other storage medium) or transmitted via the radio unit 1401 or the network module 1402. The microphone 14042 may receive sound and may be capable of processing such sound into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 1401 in case of a phone call mode.

The electronic device 1400 also includes at least one sensor 1405, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 14061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 14061 and/or the backlight when the electronic device 1400 is moved to the ear. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 1405 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which are not described in detail herein.

The display unit 1406 is used to display information input by the user or information provided to the user. The Display unit 1406 may include a Display panel 14061, and the Display panel 14061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 1407 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 1407 includes a touch panel 14071 and other input devices 14072. The touch panel 14071, also referred to as a touch screen, may collect touch operations by a user (e.g., operations by a user on or near the touch panel 14071 using a finger, a stylus, or any other suitable object or attachment). The touch panel 14071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1410, receives a command from the processor 1410, and executes the command. In addition, the touch panel 14071 can be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch panel 14071, the user input unit 1407 may include other input devices 14072. In particular, the other input devices 14072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described herein.

Further, the touch panel 14071 may be overlaid on the display panel 14061, and when the touch panel 14071 detects a touch operation on or near the touch panel 14071, the touch operation is transmitted to the processor 1410 to determine the type of the touch event, and then the processor 1410 provides a corresponding visual output on the display panel 14061 according to the type of the touch event. Although in fig. 14, the touch panel 14071 and the display panel 14061 are two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 14071 and the display panel 14061 may be integrated to implement the input and output functions of the electronic device, and is not limited herein.

The interface unit 1408 is an interface for connecting an external device to the electronic apparatus 1400. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 1408 may be used to receive input from an external device (e.g., data information, power, etc.) and transmit the received input to one or more elements within the electronic apparatus 1400 or may be used to transmit data between the electronic apparatus 1400 and the external device.

The memory 1409 may be used to store software programs as well as various data. The memory 1409 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory 1409 can include high speed random access memory and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 1410 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 1409 and calling data stored in the memory 1409, thereby performing overall monitoring of the electronic device. Processor 1410 may include one or more processing units; preferably, the processor 1410 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1410.

The electronic device 1400 may further include a power source 1411 (e.g., a battery) for supplying power to various components, and preferably, the power source 1411 may be logically connected to the processor 1410 via a power management system, so as to implement functions of managing charging, discharging, and power consumption via the power management system.

In addition, the electronic device 1400 includes some functional modules that are not shown, and are not described herein.

Preferably, an embodiment of the present application further provides an electronic device, which includes a processor 1410, a memory 1409, and a computer program that is stored in the memory 1409 and can be run on the processor 1410, and when the computer program is executed by the processor 1410, the processes of the above-mentioned voiceprint recognition model training method are implemented, and the same technical effect can be achieved, and details are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by the processor 1410, the processes of the above-mentioned voiceprint recognition model training method embodiment are implemented, and the same technical effects can be achieved, and are not described herein again to avoid repetition. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A voiceprint recognition model training method is characterized by comprising the following steps:

2. The method of claim 1, wherein the determining a probability value for adjusting the sampling of the first voiceprint sample data based on the classification result output by the pre-trained voiceprint recognition model comprises:

3. The method according to claim 2, wherein in a case that the classification result output by the pre-trained voiceprint recognition model is wrong, the increasing the probability value of the sampled first voiceprint sample data or rejecting the first voiceprint sample data comprises:

4. The method of claim 2, wherein after determining the pre-trained voiceprint recognition model after the nth iteration training as the voiceprint recognition model, the method further comprises:

randomly acquiring L voiceprint sample data sets, wherein each voiceprint sample data set comprises voiceprint sample data of at least two users, the similarity of the voiceprint sample data of the at least two users is greater than a second threshold value, and L is a positive integer;

5. The method according to claim 4, wherein the voiceprint sample data of the at least two users comprises voiceprint sample data of a target user, wherein other users except the target user belong to users in the queue of the target user, and the similarity between the voiceprint sample data of the users in the queue of the target user and the voiceprint sample data of the target user is greater than a second threshold.

6. The method according to claim 4, wherein the pre-trained voiceprint recognition model is a voiceprint recognition model after being trained by using full data, and the first voiceprint sample data and the L sets of voiceprint sample data both belong to data in the full data.

7. A voiceprint recognition method, comprising:

acquiring voiceprint data corresponding to first voice data to be recognized;

wherein the voiceprint recognition model is trained based on the voiceprint recognition model training method of any one of claims 1 to 6.

8. A voiceprint recognition model training device, comprising:

9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps in the voiceprint recognition model training method of any one of claims 1 to 6.

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for training a voiceprint recognition model according to any one of claims 1 to 6.