CN113409795B

CN113409795B - Training method, voiceprint recognition method and device and electronic equipment

Info

Publication number: CN113409795B
Application number: CN202110955765.0A
Authority: CN
Inventors: 周到; 贺刚; 陈昌滨
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-11-09
Anticipated expiration: 2041-08-19
Also published as: CN113409795A

Abstract

The disclosure provides a training method, a voiceprint recognition device and electronic equipment. The training method comprises the following steps: and acquiring a sample audio data set, and training a sound pattern recognition model based on the sample audio data set. The sample audio data set comprises a plurality of sample audio data; the voiceprint recognition model is used for determining the corresponding voiceprint characteristics of each sample audio data in the plurality of sample audio data. For different voiceprint features, the loss function of the voiceprint recognition model contains different margins, and the margins are associated with the distance from the voiceprint feature to the center of the class to which the voiceprint feature belongs. The method introduces different margins in the loss function according to different voiceprint characteristics, thereby improving the training speed and the recognition accuracy of the voiceprint recognition model.

Description

Training method, voiceprint recognition method and device and electronic equipment

Technical Field

The invention relates to the field of artificial intelligence, in particular to a training method, a voiceprint recognition device and electronic equipment.

Background

Voiceprint recognition is a biometric technique that can identify the identity of a speaker by the speaker's voice. Due to the characteristics of safety, convenience and the like of the voiceprint recognition technology, the voiceprint recognition technology is widely applied to the fields of security, intelligent home, banks, judicial law and the like.

At present, most of voiceprint recognition systems are established on the basis of a neural network and can identify the audios sent by different speakers.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a training method including:

obtaining a sample audio data set, wherein the sample audio data set comprises a plurality of sample audio data;

training a voiceprint recognition model based on the sample audio data set, wherein the voiceprint recognition model is used for determining a voiceprint feature corresponding to each sample audio data in the plurality of sample audio data, and the loss function of the voiceprint recognition model contains different margins for different voiceprint features, and the margins are associated with the distances from the voiceprint features to the class centers of the classes to which the voiceprint features belong.

According to another aspect of the present disclosure, there is provided a voiceprint recognition method including:

acquiring audio data to be identified;

extracting the voiceprint characteristics of the audio data to be recognized by utilizing a voiceprint recognition model, wherein the voiceprint recognition model is obtained by training according to the training method;

determining a speaker identity of the audio data to be recognized based on the voiceprint features.

According to another aspect of the present disclosure, there is provided an exercise device comprising:

an obtaining module configured to obtain a sample audio data set, wherein the sample audio data set includes a plurality of sample audio data;

a training module configured to train a voiceprint recognition model based on the sample audio data set, where the voiceprint recognition model is configured to determine a voiceprint feature corresponding to each sample audio data in the multiple sample audio data sets, and where, for different voiceprint features, a loss function of the voiceprint recognition model contains different margins, and the margins are associated with distances from the voiceprint features to class centers of classes to which the voiceprint features belong.

According to another aspect of the present disclosure, there is provided a voiceprint recognition apparatus including:

the acquisition module is used for acquiring audio data to be identified;

the extraction module is used for extracting the voiceprint characteristics of the audio data to be recognized by utilizing a voiceprint recognition model, wherein the voiceprint recognition model is obtained by training according to the training method;

and the determining module is used for determining the speaker identity of the audio data to be recognized based on the voiceprint characteristics.

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor; and the number of the first and second groups,

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform a training method according to an embodiment of the disclosure.

a processor; and the number of the first and second groups,

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform a voiceprint recognition method according to an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform a training method according to an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium, characterized in that the non-transitory computer readable storage medium stores computer instructions for causing the computer to execute a voiceprint recognition method according to an embodiment of the present disclosure.

In one or more technical solutions provided in the embodiments of the present application, for different voiceprint features, a loss function of a voiceprint recognition model includes different margins, and since the margins may be regarded as forces that push voiceprint features to a class center, when the margins are associated with distances from the voiceprint features to the class center of the class to which the voiceprint features belong (for short, distances of the voiceprint features), an influence of the distances of the voiceprint features on the margins may be introduced into the loss function. Based on this, when the voiceprint recognition model is trained based on the sample audio data set, the model parameters of the voiceprint recognition model can be updated more quickly by using the loss function, and the trained voiceprint recognition model can accurately determine the identity of the speaker.

Drawings

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 shows a schematic diagram of a system illustrating a training method provided in accordance with an exemplary embodiment of the present disclosure;

fig. 2 is a schematic diagram of a system architecture illustrating a voiceprint recognition method provided in accordance with an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a schematic flow chart diagram of a training method provided by an exemplary embodiment of the present disclosure;

FIG. 4 shows a schematic flow chart of a voiceprint recognition method of an exemplary embodiment of the present disclosure;

FIG. 5 shows a schematic interface diagram presented by a user device of an exemplary embodiment of the present disclosure;

FIG. 6 shows a schematic block diagram of functional modules of a training apparatus according to an exemplary embodiment of the present disclosure;

FIG. 7 shows a functional block schematic block diagram of a voiceprint recognition apparatus according to an exemplary embodiment of the present disclosure;

FIG. 8 shows a schematic block diagram of a chip according to an example embodiment of the present disclosure;

FIG. 9 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information. Before describing the embodiments of the present application, the terms related to the embodiments of the present application will be explained as follows:

voiceprint recognition is one of the biometric identification technologies, also called speaker recognition, and can convert an audio signal into an electrical signal and then perform recognition by using a computer.

The neural network is an arithmetic mathematical model which imitates the behavior characteristics of the animal neural network and performs distributed parallel information processing.

The fully-connected layer (full-connected layer) is a classifier, and can map a high-dimensional feature map obtained by feature extraction into a one-dimensional feature vector. The one-dimensional feature vector contains all feature information and can be converted into probabilities of various categories of final classification.

The loss function being used to measure between predicted and true valuesAnd (4) poor. The predicted values of the neural network are weightsωAnd b, the predicted value and the true value are not always completely consistent, and the error of the predicted value and the true value can be expressed as a loss functionL(ω,b)=f(H _ω,b(x),y)。

Softmax loss function: is a commonly used classical loss function, and is commonly used in classification tasks based on neural networks.

The back propagation algorithm is a network parameter for optimizing the neural network by using a gradient descent method, and is characterized in that the value of a loss function is calculated according to the value calculated by the neural network and an expected value, then the partial derivative of the loss function on the model parameter is calculated, and finally the network parameter is updated.

The model parameters include a weight parameter representing a slope of the hyperplane and a bias parameter representing an intercept of the hyperplane.

Exemplary embodiments of the present disclosure provide a training method that may accurately recognize voiceprint features using a voiceprint recognition model that may be trained using an optimized loss function.

Fig. 1 shows a schematic diagram of a system for illustrating a training method according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the system 100 provided by the exemplary embodiment of the present disclosure is a deep learning processor, which includes a control module 110, a storage module 120, and an operation module 130. The control module 110 is used for controlling the operation module 130 and the storage module 120 to work, and completing a deep learning task; the operation module 130 performs the computation task of deep learning, and the storage module 120 is used for storing or transporting related data.

As shown in fig. 1, the control module 110 includes an instruction fetch unit 111 and an instruction decode unit 112. The instruction fetching unit 111 is used for fetching an instruction from an off-chip memory (e.g. a DRAM, i.e. a popular memory bank), and the instruction decoding unit 112 decodes the instruction and sends the decoded instruction to the operation module 130 and the storage module 120 for execution.

As shown in fig. 1, the operation module 130 includes a vector operation unit 131 and a matrix operation unit 132, the vector operation unit 131 performs vector operation, can support complex operations such as vector multiplication, addition, and nonlinear transformation, is suitable for multiple operation modes of multiple data types, and can support preprocessing of input neurons and post-processing of output neurons such as table lookup, pooling, edge expansion, vector comparison, vector minimization, and data format conversion. The matrix operation unit 132 is responsible for the core calculation of the deep learning algorithm: matrix multiplication and convolution can realize the operation functions of convolution layer and full connection layer. The operation amount of the matrix operation unit 132 accounts for more than 90% of the whole neural network algorithm, so the operation unit of the matrix operation unit 132 can adopt a low-bit-width operator to reduce the chip size and the power consumption, the multiplier can be designed by adopting a parallel multiplier scheme, and three operation modes of INT16 × INT16, INT8 × INT8 and INT8 × INT4 are supported. In addition, the neural network has the characteristic of sparsity, and the matrix operation unit 132 also supports sparsity processing on the weight or the neuron which is 0, so that a large amount of operation energy consumption overhead is reduced.

As shown in fig. 1, the storage module 120 includes a direct memory access unit 121, a neuron storage unit 122, and a weight storage unit 123. The dma 121 may coordinate data interaction between the neuron storage 122 and weight storage 123 and off-chip storage. The neuron storage unit 122 is used for storing data such as input neurons, output neurons and intermediate results of the deep learning network, the weight storage unit 123 is used for storing weights of the deep learning network, and the direct memory access unit 121 is connected with the off-chip memory and the neuron storage unit 122 and the weight storage unit 123 through a storage Bus and is responsible for data semicircles between the off-chip memory and the neuron storage unit 122 and the weight storage unit 123.

As shown in fig. 1, when the deep learning processor starts a deep learning operation, the instruction fetch unit 111 reads a program instruction from the off-chip memory through the direct memory access unit 121, and the program instruction is decoded and distributed to the direct memory access unit 121, the vector operation unit 131, and the matrix operation unit 132 through the instruction decode unit 112. After receiving the distributed instruction, the direct memory access unit 121 sends the access instruction to the off-chip memory through the storage Bus to read the neuron data stored in the off-chip memory to the neuron storage unit 122 by using the instruction, and read the weight stored in the off-chip memory to the weight storage unit 123; after receiving the instruction, the vector storage unit reads neuron data in the neuron storage unit 122, performs preprocessing (such as boundary expansion) on the neuron data, and sends the processed neuron data to the matrix operation unit 132; the matrix operation unit 132 receives the preprocessed neuron data from the vector operation unit 131 after receiving the instruction, reads the weight data from the weight storage unit 123, and sends the result to the vector operation unit 131 after completing the matrix operation; the vector operation unit 131 performs post-processing, such as activation or pooling, on the output neurons, and then stores the post-processing results in the neuron storage unit 122; the direct memory access unit 121 writes the output neurons from the neuron storage unit 122 back to the off-chip memory.

Fig. 2 is a schematic diagram of a system architecture illustrating a voiceprint recognition method according to an exemplary embodiment of the present disclosure. As shown in fig. 2, a system architecture 200 provided by an exemplary embodiment of the present disclosure includes: user device 210, execution device 220, and data storage system 230.

As shown in fig. 2, the user device 210 may communicate with the execution device 220 through a communication network. The communication network may be a wired communication network or a wireless communication network. The limited communication network may be a communication network based on power line carrier technology, and the wireless communication network may be a local area wireless network or a wide area wireless network. The local wireless network can be a WIFI wireless network, a Zigbee wireless network, a mobile communication network or a satellite communication network and the like.

As shown in fig. 2, the user device 210 may include an intelligent terminal such as a computer, a mobile phone, or an information processing center, and the user device 210 may serve as an initiating terminal for voiceprint recognition to initiate a request to the execution device 220. The execution device 220 may be a server having a data processing function, such as a cloud server, a web server, an application server, and a management server, to implement the voiceprint recognition method. The server may be configured with a Deep Learning Processor, which may be a neuron of a single-core Deep Learning Processor (DLP-S) or a multi-core Deep Learning Processor (DLP-M). The DLP-M is multi-core expansion based on the DLP-S, and a plurality of DLP-S are interconnected, multicast, internuclear synchronization and other protocols are used for internuclear communication through a Network-on-chip (Noc).

As shown in fig. 2, the data storage system 230 may be a general term, and includes a database for locally storing and storing historical data, which may be on the execution device 220, on another network server, or on the data storage system 230. The data storage system 230 may be separate from the execution device 220 or may be integrated within the execution device 220. The data storage system 230 may not only input data uploaded by the user device 210, but may also store program instructions, neuron data, weight data, etc., which may be trained data. In addition, the data storage system 230 may also store a processing result (such as a pre-processing voiceprint feature to be recognized, an intermediate processing result, or a voiceprint recognition result) obtained by processing by the execution device 220 into the data storage system 230.

In practical applications, as shown in fig. 2, the user device 210 may have an audio capture function, so that the user device 210 may not only initiate a request to the execution device 220 through the interactive interface, but also capture audio data to be recognized, and send the audio data to be recognized to the execution device 220 through the interactive interface. Based on this, when the execution device 220 implements the voiceprint recognition method, the voiceprint recognition method used can be obtained not only from the data storage system 230 but also from the user device 210 through the interactive interface. In addition, when the execution device 220 implements the voiceprint recognition method, the voiceprint recognition result thereof may be not only fed back to the user device 210 through the communication network, but also stored in the data storage system 230.

In the related art, a structure of a neural network is usually constructed based on characteristics of input features in a frequency domain and a time domain, so as to extract features related to a speaker from the input features as much as possible and eliminate interference information unrelated to the speaker, and in order to enable the extracted voiceprint features to express the identity of the speaker and to be different from voiceprint features of other speakers to a certain extent, the neural network is required to have good voiceprint recognition features on audio data. For example: the loss function may be optimized such that the back propagation algorithm updates the model parameters of the voiceprint recognition model. Based on this, whether the loss function is suitable or not directly affects the identifiability of the voiceprint recognition model to the voiceprint features.

At present, the loss function commonly used by the voiceprint recognition model is a loss function based on metric learning and some variants of a Softmax loss function, the former usually has a sampling problem of mining difficult samples, so that the performance of the voiceprint recognition model is unstable and the training is very time-consuming, and the latter is easier to train and has performance which is not lower than that of the former. For the Softmax loss function, it can be optimized in a way that it introduces an additive margin. The L-Softmax loss function shown in equation one is described below as an example.

Is like

In the case of the voiceprint recognition model,Nrepresents the amount of sample audio data in the sample audio data set,Crepresenting the number of classes of voiceprint features determined by the voiceprint recognition model based on the plurality of sample audio data,ynis shown asnThe class number of the class to which the individual voiceprint feature belongs,x _nis shown asnThe voiceprint features can be regarded as output features of the penultimate layer (namely, the fully-connected layer above the classification layer) of the voiceprint recognition model,W _ynindicates to belong toynThe weight vector corresponding to the voiceprint feature of the category may be considered as the secondnThe class center of each voiceprint feature, which is the weight vector assigned to the calculation of the loss function,θ _ynto representx _nAndW _ynthe angle of,b _ynindicates to belong toynAnd the voiceprint features of the category correspond to the bias vectors.W _mIndicates to belong tomCorresponding to voiceprint features of a categoryThe weight vector, which can be considered as the secondmA class center of voiceprint features of a class, which is a weight vector assigned to compute a loss function,θ _n,mis composed ofx _nAndW _mthe angle of,b _mindicates to belong tomThe voiceprint features of a category correspond to a bias vector,Tis a transposed matrix symbol.

In the related art, the limiting condition is introduced into the L-Softmax loss function shown in the formula one, and the L-Softmax loss function can be converted into the A-Softmax loss function shown in the formula twoL _AM。

Formula II

In the formula two, the first and second catalysts are selected,ψ(θ _yn) Show aboutθ _ynAs a function of (a) or (b),ψ(θ _n,m) Show aboutθ _n,mFor other parameter interpretations, reference is made to the preceding relevant explanations.

Then, under the condition of introducing additive margin, the formula two is transformed, and the formula two can be converted into an additive margin loss function shown in the formula threeL0。

Formula III

In the formula III, the first and second organic solvents are,sthe hyper-parameter may be a constant, such as 30, but not limited thereto,λfor the margin, it may be a constant, and other parameter explanations refer to the related explanations above.

From the above, the additive margin loss function shown in equation three can be obtained based on the L-Softmax loss function. The additive margin loss function may select a constant as the margin for different voiceprint features.

The inventors have found that for voiceprint features belonging to the same class, the smaller the distance of each voiceprint feature from the centre of the class, the better, but the different distances of the voiceprint features belonging to the class from the centre of the class. The final goal of neural network training is to narrow the distance of the voiceprint features from the class center. Since the margin can be understood as the force pushing the voiceprint feature towards the centre of the class, it is unreasonable for the remaining amount to be the same for voiceprint features at different distances from the centre of the class.

When sample audio data of a large number of speakers exist in a sample audio data set, especially when sample audio data under different scenes are mixed, the corresponding distribution of the sample audio data of different speakers in a voiceprint feature space can be obviously different due to the difference of the voice data volume and quality (for example, for an audio A with 200 clean audios and an audio B with 100 noisy audios, the distribution area is wider, the distance from the class center is larger, a larger margin value can be used for pushing the sample to approach the class center, for an audio with larger noise, the distance from the sample to the class center is larger, a larger margin value is also required for pushing the sample to approach the class center, and the distribution characteristics of the data should be considered when margin setting is carried out)

The training method provided by the exemplary embodiment of the present disclosure may be performed by a deep learning processor or a chip in the deep learning processor. Fig. 3 illustrates a flowchart of a training method provided by an exemplary embodiment of the present disclosure. As shown in fig. 3, a training method provided by an exemplary embodiment of the present disclosure includes:

step 301: a sample audio data set is obtained, the sample audio data set comprising a plurality of sample audio data. The multiple sample audio data may be from different speakers or may be from the same speaker. When the plurality of sample data come from different speakers, part of the sample audio data may come from different speakers, or all of the sample audio data may come from different speakers.

In practical applications, the data contained in the sample audio data set is stored in an off-chip memory. When the deep learning processor starts to execute the training method, the sample audio data set can be fetched from the off-chip memory by using the direct memory access unit and is guaranteed to be in the neuron storage unit. The multiple sample audio data contained in the sample audio data set can be labeled, and the labeling method can be manual labeling or labeling based on a neural network. The annotated content may be the speaker identity of the sample audio data.

Step 302: training a voiceprint recognition model based on a sample audio data set, wherein the voiceprint recognition model is used for determining a voiceprint feature corresponding to each sample audio data in a plurality of sample audio data, aiming at different voiceprint features, a loss function of the voiceprint recognition model contains different margins, and the margins are associated with the distance from the voiceprint feature to the center of the category to which the voiceprint feature belongs. The structure of the voiceprint recognition model can adopt various relatively mature neural network structures, such as a Resnet neural network structure, an X-vector neural network structure, and the like, but is not limited thereto.

Since the margin may be considered as a force pushing a voiceprint feature towards the centre of a class, the effect of the distance of a voiceprint feature on the margin may be introduced into the loss function when the margin is associated with the distance of the voiceprint feature from the centre of the class to which the voiceprint feature belongs (referred to as the distance of the voiceprint feature). Based on this, when the voiceprint recognition model is trained based on the sample audio data set, the model parameters of the voiceprint recognition model can be updated more quickly by using the loss function, and the trained voiceprint recognition model can accurately determine the identity of the speaker.

As a possible implementation, the loss function may be a Softmax loss function, which may be a common Softmax loss function and its variant. The loss function satisfies the loss function shown as equation four.

Formula IV

In the formula four, the reaction mixture is,Lin order to be a function of the loss,λ _nis shown asnThe balance of the individual voiceprint characteristics,θ _ynto representx _nAndW _ynthe angle of,θ _n,mto representx _nAndW _mthe angle of,x _nis shown asnThe characteristics of the individual voice print are,nis greater than or equal to 1 and less than or equal toNThe number of the integer (c) of (d),Nrepresenting a quantity of sample audio data in a sample audio data set;

W _ynindicates to belong toynWeight vector of voiceprint features of a class, number onenThe vocal print is characterized as belonging toynOne voiceprint feature of a class of voiceprint features,W _mindicates to belong tomA weight vector for the voiceprint features of a class,m≠yn，mandynare all greater than or equal to 1 and less than or equal toCThe number of the integer (c) of (d),Cthe number of categories representing the voiceprint characteristics,CandNare all integers greater than 2 and are each,srepresenting a hyper-parameter.

In an alternative, the margin is positively correlated to the distance of the voiceprint feature. Based on this, when the distance of the voiceprint features is smaller, the force for pushing the voiceprint features to the center of the category to which the voiceprint features belong is smaller. When the distance between the voiceprint features is larger, the force for pushing the voiceprint features to the center of the category to which the voiceprint features belong is larger, so that the voiceprint features belonging to the center of the category can be more quickly close to the center of the category.

Illustratively, whenN=2，CIf =1, the sample audio data set includes two types of sample audio data, and the number of sample audio data in each type is 1. Based on this, the loss function shown in equation four can be expressed as equation five:

formula five

In the formula five, the reaction mixture is,L1 is whenN=2，CA loss function when =1 is determined,sfor the hyper-parameter, a value of 30 may be chosen,λ ₁representing the margin for the 1 st voiceprint feature,θ _y1to representx ₁AndW _y1the angle of,θ _1,2is composed ofx ₁AndW ₂the angle of,x ₁the 1 st voiceprint feature is represented,W _y1is a weight vector for voiceprint features belonging to category 1,W ₂a weight vector representing voiceprint features not belonging to category 2 of the 1 st voiceprint feature. At this time, the process of the present invention,W _y1is composed ofx ₁Class center of (1), cos (θ _y1) The distance between the 1 st vocal print feature and the center of the category to which the 1 st vocal print feature belongs is defined as a first distance.

λ ₂Representing the margin for the 2 nd voiceprint feature,θ _y2to representx ₂AndW _y2the angle of,θ ₂is composed ofx ₂AndW ₁the angle of,x ₂the 2 nd voiceprint feature is represented,W _y2is a weight vector for voiceprint features belonging to category 2,W ₁a weight vector representing voiceprint features not belonging to category 1 of the 2 nd voiceprint feature. At this time, the process of the present invention,W _y2is composed ofx ₂Class center of (1), cos (θ _y2) And the distance from the 2 nd vocal print characteristic to the center of the class to which the 2 nd vocal print characteristic belongs is defined as a second distance.

When the first distance is greater than the second distance,λ ₁＞λ ₂. Since the quality of the 2 nd voiceprint feature is higher than the quality of the 1 st voiceprint feature when the first distance is greater than the second distance, the quality of the 2 nd voiceprint feature is greater than the quality of the second voiceprint featureλ ₁＞λ ₂Margins for the 1 st voiceprint feature can be utilizedλ ₁It is ensured that the 1 st voiceprint feature can be quickly (relatively to the 2 nd voiceprint feature close to the class center to which the 2 nd voiceprint feature belongs) close to the class center to which the 1 st voiceprint feature belongs. Based on the method, the distance difference between the two voiceprint features and the center of the category to which the two voiceprint features belong can be balanced by utilizing the margin difference aiming at the two voiceprint features, so that the training speed is increased, and the recognition accuracy of the trained model to the voiceprint features is ensured.

As a possible implementation, for a plurality of voiceprint features belonging to the same category, the margin satisfies a first gaussian distribution. That is, when there are a plurality of voiceprint features belonging to the same category, the corresponding plurality of margins satisfy the first gaussian distribution. The margin distribution mode meets the distribution rule of the sample audio data of the same category in the feature space in the actual scene, so that the margin configured for each voiceprint feature is closer to the actual scene, and the voiceprint feature identification accuracy of the trained voiceprint model is further improved.

For the first gaussian distribution, the mean and variance of the first gaussian distribution may determine the position and shape of the gaussian curve. For example: the mean value mu 1 of the first Gaussian distribution determines the position of the symmetry axis of the Gaussian curve, and when x is greater than the mean value of the first Gaussian distribution, y is the maximum value of the Gaussian curve. Meanwhile, the variance σ 1 of the first gaussian distribution may determine the maximum value size of the gaussian curve, and the distribution density of the 8 residuals may be determined by the variance σ 1 of the first gaussian distribution.

Illustratively, whenN=2，CIf =3, the sample audio data set contains 6 sample audio data, and the voiceprint features of the 6 sample audio data are numbered as the 1 st voiceprint feature respectivelyx ₁2 nd voiceprint featurex ₂3 rd voiceprint featurex ₃4 th voiceprint featurex ₄5 th voiceprint featurex ₅And 6 th voiceprint featurex ₆. The sample audio data contained in the sample audio data set can be divided into two types, and the voiceprint features of the first type of sample audio data are respectively the 1 st voiceprint featuresx ₁2 nd voiceprint featurex ₂And 3 rd voiceprint featurex ₃The voiceprint features of the first type sample audio data are respectively the 4 th voiceprint featurex ₄5 th voiceprint featurex ₅And 6 th voiceprint featurex ₆. Based on this, the loss function shown in equation four can be expressed as equation six:

formula six

In the formula six, the reaction mixture is,L2 is whenN=6, loss function at C =2,sfor a hyper-parameter, the value may be 30.

λ ₁Representing the margin for the 1 st voiceprint feature,θ _y1to representx ₁AndW _y1the angle of,θ _1,2is composed ofx ₁AndW ₂the angle of,λ ₂representing the margin for the 2 nd voiceprint feature,θ _y2to representx ₂AndW _y1the angle of,θ _2,2is composed ofx ₂AndW ₂the angle of,λ ₃representing the margin for the 3 rd voiceprint feature,θ _y3to representx ₃AndW _y1the angle of,θ _3,2is composed ofx ₃AndW ₂the angle of,W _y1is composed ofx ₁、x ₂Andx ₃the center of the category of (1),W _y1is a weight vector for voiceprint features belonging to category 1,W ₂weight vectors representing voiceprint features other than the 1 st to 3 rd voiceprint features.

λ ₄Representing the margin for the 4 th voiceprint feature,θ _y4to representx ₄AndW _y2the angle of,θ _4,1is composed ofx ₄AndW ₁the angle of,λ ₅representing the margin for the 5 th voiceprint feature,θ _y5to representx ₅AndW _y2the angle of,θ _5,1is composed ofx ₅AndW ₁the angle of,λ ₆representing the margin for the 6 th voiceprint feature,θ _y6to representx ₆AndW _y2the angle of,θ _6,1is composed ofx ₆AndW ₁the angle of,W _y2is composed ofx ₄、x ₅Andx ₆the center of the category of (1),W _y2is a weight vector for voiceprint features belonging to category 2,W ₁weight vectors representing voiceprint features other than the 4 th to 6 th voiceprint features.

From the above, it can be seen that for the voiceprint features belonging to category 1, the loss function contains 3 margins, each of which isλ ₁、λ ₂Andλ ₃，λ ₁、λ ₂andλ ₃a first gaussian distribution is presented. For voiceprint features belonging to category 2, the loss function contains 3 margins, each beingλ ₄、λ ₅Andλ ₆，λ ₄、λ ₅andλ ₆a first gaussian distribution is presented.

In an alternative, if the number of voiceprint features belonging to each category is 1, then the concept that there is no first gaussian distribution is illustrated. At the moment, a Gaussian distribution rule is presented based on the voiceprint features belonging to different categories, the allowance satisfies second Gaussian distribution for a plurality of voiceprint features belonging to different categories, loss functions can be introduced into the voiceprint feature distribution rules of different categories, and when the loss functions are adopted to update model parameters, voiceprint feature training can be completed more quickly, and the voiceprint recognition model is guaranteed to have higher recognition accuracy.

Illustratively, the above-mentioned margin and the actual class quality of the voiceprint feature satisfy a first mapping relationship, and the first mapping relationship satisfies: the margin is inversely related to the actual class quality of the margin map. That is to say, when the actual class quality is increased, the margin is reduced, and here, the variation trend of the margin is associated with the variation trend of the actual class quality, so that when the margin is introduced into the loss function, the variation trend of the margin and the actual class quality of the voiceprint feature is also introduced into the loss function, thereby completing the voiceprint feature training more quickly and ensuring that the voiceprint recognition model has higher recognition accuracy.

In the case where the number of voiceprint features belonging to each category is an integer greater than or equal to 2, the concept of a first gaussian distribution is illustrated. At this time, the mean of the first gaussian distribution satisfies the second gaussian distribution for the voiceprint features belonging to the different classes. The mean of the second gaussian distribution can be determined according to practical situations, for example, the mean μ 2 of the second gaussian distribution can be a constant of 0.2-0.4 (e.g., 0.2, 0.3, or 0.4), and the variance σ 2 of the second gaussian distribution can be a constant of 0.0005-0.002 (e.g., 0.0015), so as to determine the gaussian curves of the second gaussian distribution and obtain the mean distribution rule of each first gaussian distribution. On this basis, the variance σ 1 (for example, 0.0005 to 0.002) of the first gaussian distribution can be specified, and the distribution rule of the margin is determined by using the mean value of the first gaussian distribution, so that the purpose of introducing different margins into the loss function according to different loss functions is achieved.

For voiceprint features belonging to the same class, the mean of the first gaussian distribution is equal to a first reference factor. That is, for the same category of voiceprint features, there may be one first reference factor and a plurality of residuals in a first gaussian distribution whose mean is the first reference factor. For the voiceprint features belonging to different classes, the distances from the voiceprint features belonging to different classes to the class center are also in Gaussian distribution, so that from the characteristic of data Gaussian distribution, the first reference factor is in Gaussian distribution for the voiceprint features belonging to different classes. Based on the above, the symmetrical positions of the gaussian curves of the first gaussian distribution can be regulated and controlled through the gaussian distribution of the first reference factor, so that the symmetrical positions of the first gaussian distribution are matched with the spatial distribution rules of different types of voiceprint features, and the method for distributing the allowance for each feature vector is further optimized.

Illustratively, whenN=12，C=3, the number of voiceprint features belonging to each category is 4. Aiming at the voiceprint features belonging to 3 categories, 3 first reference factors are correspondingly arranged one by one, namely a 1 st first reference factor, a 2 nd first reference factor and a 3 rd first reference factor.

For voiceprint features belonging to category 1, the loss function contains 4 residuals and exhibits a first gaussian distribution whose mean is the 1 st first reference factor. For voiceprint features belonging to category 2, the loss function contains 4 residuals and exhibits a first gaussian distribution whose mean is the 2 nd first reference factor. For voiceprint features belonging to category 3, the loss function contains 4 residuals and exhibits a first gaussian distribution whose mean is the 3 rd first reference factor. Meanwhile, the 1 st first reference factor, the 2 nd first reference factor and the 3 rd first reference factor are in a second Gaussian distribution.

To further optimize the first reference factor, it may be defined that the first reference factor satisfies the first mapping relation with the actual class quality of the voiceprint feature. The first mapping relation satisfies: the first reference factor is inversely related to the actual class quality of the first reference factor map. That is, when the actual class quality increases, the first reference factor decreases so that the trend of change of the first reference factor is associated with the trend of change of the actual class quality. And aiming at a plurality of voiceprint features belonging to the same category, the symmetrical positions of Gaussian curves of first Gaussian distribution constructed by a plurality of margins can be determined by using a first reference factor, so that when the margins are introduced into a loss function, the variation trend of the margins and the variation trend of the actual category quality of the voiceprint features are introduced into the loss function, the voiceprint feature training is completed more quickly, and the voiceprint recognition model is ensured to have higher recognition accuracy.

In practical applications, the actual class quality of a voiceprint feature is not only related to the theoretical quality of the voiceprint feature, but also related to the number proportion of the voiceprint features. Based on this, regardless of the number of voiceprint features of the same category, for voiceprint features belonging to the same category, the actual category quality satisfies: the actual class quality is positively correlated with the average class quality of the voiceprint features and the actual class quality is negatively correlated with the number of voiceprint features belonging to the same class.

For example: if the number of the voiceprint features is large, the number of the audios sent by the speaker is large, such as 200, the distribution area is wide, the distance from the voiceprint features to the center of the category is long, and a larger margin is needed to push the voiceprint features to approach the center of the category. Moreover, for the sample audio data with larger noise, the distance between the voiceprint feature and the center of the class to which the voiceprint feature belongs is larger, and a larger margin value is also needed to push the sample to be close to the class center. Based on this, when the number of the voiceprint features is large, the noise ratio is large, the actual class quality is low, and the margin for the voiceprint features is large relative to the margins for the voiceprint features of other classes.

Under the condition that the number of the voiceprint features belonging to each category is 1, the allowance is inversely related to the actual category quality of the allowance mapping, so that the larger the number of the voiceprint features of the category is, the larger the noise ratio is, the larger the allowance can be mapped, the allowance distribution of the loss function is consistent with the actual application scene aiming at the voiceprint features belonging to different categories, and the voiceprint recognition model can be trained more quickly.

In the case that the number of the voiceprint features belonging to each category is an integer greater than or equal to 2, since the first reference factor is inversely related to the actual category quality mapped by the first reference factor, when the number of the voiceprint features is large and the noise ratio is large, the first reference factor is also larger for the voiceprint features. Since the mean value of the first gaussian distribution is equal to the first reference factor for the voiceprint features belonging to the same category, and the mean value of the first gaussian distribution determines the position of the margin satisfying the first gaussian distribution rule, the larger the first reference factor is, the larger the margin for the voiceprint features of the category is. Therefore, the loss function aims at the voiceprint features belonging to different categories, the margin distribution is consistent with the actual application scene, and the voiceprint recognition model can be trained more quickly.

Illustratively, the above actual class quality satisfies the formula shown in equation seven:

formula seven

In the seventh expression,

indicates to belong toiThe actual class quality of the voiceprint features of a class,

indicates to belong toiThe average class quality of the voiceprint features of a class,K _iindicates to belong toiA ratio of voiceprint characteristics of the class to a number of sample audio data in the sample audio data set,αdenotes the balance factor, 0 <α＜1。

Illustratively, the average class quality of the voiceprint features described above satisfies the formula shown in equation eight:

type eight

The number of voiceprint features belonging to the same category satisfies the formula shown in formula nine:

nine-degree of expression

In the formulae eight and nine, the reaction mixture,Nrepresents the amount of sample audio data in the sample audio data set,Cthe number of categories representing the voiceprint characteristics,n _iindicates to belong toiThe number of voiceprint features of a category,iis not less than 1 and not more thanCThe number of the integer (c) of (d),kis not less than 1 and not more thann _iThe number of the integer (c) of (d),CandNis an integer greater than or equal to 2,n _iis an integer greater than or equal to 1;

indicates to belong toiThe average quality of the voiceprint features of a class,K _iindicates to belong toiVoiceprint features of classes and sample audio data in a sample audio data setThe ratio of the amounts of the components,θ _ikindicates to belong toiClass onekThe vocal print features and belongs toiAnd classifying the included angle of the weight vector of the voiceprint features.

From top to bottom, belong toiThe actual class quality expression for the voiceprint features of a class consists of two terms, the first term expressing the quality of the voiceprint features of the class and can be regarded as belonging to the second termiThe average class quality of the voiceprint-like features,K _iindicates to belong toiA ratio of voiceprint characteristics of the classes to a number of sample audio data in the sample audio data set by a balance factorαIn combination, can ensureiThe actual class quality of the voiceprint characteristics of the class can be fused with two factors of the audio noise of the speakers and the number of the speakers, and the allowance introduced into the loss function is ensured to be in accordance with the actual application scene, so that the voiceprint recognition model is trained quickly, and the trained voiceprint recognition model can accurately recognize voiceprints.

In practical application, for voiceprint-like features belonging to different classes, the actual class quality of the voiceprint-like features can be calculated based on the weight vector, the number of the voiceprint features and the like of each voiceprint feature and the voiceprint feature belonging to the class, and a group of gaussian-distributed random values is randomly generated, wherein the number of the random values is the same as the class number of the voiceprint features. And finally, determining the mapping relation between the actual class quality and the random value according to a mapping mode that the selected random value is gradually reduced along with the increase of the actual class quality, and taking the random value having the mapping relation with the actual class quality as a first reference factor. As for the correspondence relationship between the first reference factor and the voiceprint feature class, it can be determined by the voiceprint feature class involved in determining the second reference factor.

In an alternative, each margin and the second reference factor satisfy the second mapping relation for a plurality of voiceprint features belonging to the same category. The second mapping relation satisfies: the margin is inversely related to a second reference factor of the margin map. Here the second reference factor may be used to evaluate the quality of the voiceprint feature such that a change in margin for a different second reference factor mapping may indirectly indicate a change in quality of the voiceprint feature. Based on the method, aiming at a plurality of voiceprint features belonging to the same category, the change of the margin is closely related to the quality change of the voiceprint features, and then the voiceprint recognition model training is completed quickly.

Illustratively, the second reference factor satisfies at least one of the following conditions for each voiceprint feature belonging to the same class:

the first condition is that: the second reference factor is positively correlated with the quality of the voiceprint features, and the second reference factor is negatively correlated with the sum of the qualities of the voiceprint features belonging to the same class. That is, in the case where the sum of the qualities of the voiceprint features belonging to the same category remains unchanged, the second reference factor is smaller as the quality of the voiceprint features is larger; in the case where the quality of the voiceprint features is unchanged, the second reference factor is larger as the sum of the qualities of the voiceprint features belonging to the same class is smaller.

The second condition is that: the second reference factor is used to characterize the ratio of the quality of the voiceprint features to the sum of the qualities of the voiceprint features belonging to the same class. The quality of the voiceprint feature is used to characterize the distance between the voiceprint feature and the weight vector of the class to which the voiceprint feature belongs.

The quality of the voiceprint feature can be represented by an inner product of the voiceprint feature and a weight vector of a category to which the voiceprint feature belongs, and can also be represented in other similar manners. For example: the quality of the voiceprint feature satisfies the formula ten as follows:

M _ij=1+cos(θ _ij) Formula ten

Similarly, the sum of the qualities of voiceprint features belonging to the same class satisfies the formula eleven as follows:

formula eleven

In the seventh expression,ican represent soundThe number of the category to which the line feature belongs is greater than or equal to 1 and less than or equal toCThe number of the integer (c) of (d),Cthe number of categories representing the voiceprint characteristics,Cis an integer greater than or equal to 2.jAndkeach may represent a number of voiceprint features in the category to which the voiceprint feature belongs, the value of which is greater than or equal to 1 and less than or equal ton _iThe number of the integer (c) of (d),n _iindicates to belong toiThe number of the vocal print features of the category,n _iis an integer greater than or equal to 1.

M _iIndicates to belong toiThe sum of the qualities of the voiceprint features of the classes,M _ijindicates to belong toiClass onejThe quality of the individual voice print features,θ _ijindicates to belong toiClass onejThe vocal print features and belongs toiThe angle of the weight vector of the class voiceprint feature,θ _ikindicates to belong toiClass onekThe vocal print features and belongs toiThe angle of the weight vector of the class voiceprint feature.

When the second reference factor is used to characterize a ratio of the quality of the voiceprint features to the sum of the qualities of the voiceprint features belonging to the same class, the second reference factor may satisfy the following expression twelve:

twelve formulas

t _ijCan be expressed as belonging toiClass onejA second reference factor for voiceprint feature mapping, the feature parameter factor being defined asiClass onejA second reference factor.

In practical application, for the same type of voiceprint features, all second reference factors of the type of voiceprint features can be calculated based on the weight vector of each voiceprint feature and the voiceprint feature of the type to which the voiceprint feature belongs, and a group of Gaussian distributed random values is randomly generated, wherein the number of the random values is the same as the number of the voiceprint features. And finally, determining the mapping relation between the second reference factors and the random values according to a mapping mode that the selected random values gradually decrease along with the increase of the second reference factors, and taking the random values having the mapping relation with the second reference factors as margins. As for the correspondence relationship between the margin and the voiceprint feature, it can be determined by the voiceprint feature involved in determining the second reference factor.

In order to clearly describe the method for determining the residual amount of the loss function according to the exemplary embodiment of the present disclosure, a related scenario of a formula shown in formula six is taken as an example and is described below.

The first stage is as follows: the first reference factor for each class is determined by the actual class quality of the voiceprint features of each class.

Determining the actual class quality of the voiceprint features belonging to the 1 st class and the actual class quality of the voiceprint features belonging to the 2 nd class based on seven to nine formulas, wherein the actual class quality of the 1 st class is short for

And actual class quality of class 2

. In the context of the formula shown in equation 6 herein, there are a total of two classes of voiceprint features. For convenience of description, the 1 st random number φ 1 and the 2 nd random number φ 2 are given, and the 1 st random number φ 1 and the 2 nd random number φ 2 are simply considered to be a set of random numbers conforming to a second Gaussian distribution.

When phi 1 is larger than phi 2,

then, according to the mapping rule that the random number is smaller as the actual class quality is larger, the actual class quality of the 1 st class

Mapped with the 2 nd random number phi 2 such that the 1 st first reference factor is equal to the 2 nd random number phi 2 for voiceprint features belonging to the 1 st class. Similarly, the actual class quality of class 2 may be determined

Mapped with the 1 st random number φ 1 such that the 2 nd first reference factor equals the 2 nd random number φ 2 for voiceprint features belonging to the 2 nd class.

And a second stage: the margin for each voiceprint feature for each category is determined by a first reference factor for that category.

Determining the 1 st second reference factor of the 1 st category based on the formula shown in formula twelvet ₁₁Class 1, 2 nd second reference factort ₁₂Class 1, 3 rd second reference factort ₁₃Class 2, 1 st second reference factort ₂₁Class 2, 2 nd second reference factort ₂₂And class 2, 3 rd second reference factort ₂₃。

Taking the 1 st first reference factor as the mean, given 3 third random numbers φ 31, φ 32, φ 33 with a first Gaussian distribution, the 1 st second reference factor for the 1 st classt ₁₁Class 1, 2 nd second reference factort ₁₂Class 1, 3 rd second reference factort ₁₃Performing ascending arrangement to obtain ascending arrayt ₁₁＞t ₁₂＞t ₁₃And carrying out descending arrangement on phi 31, phi 32 and phi 33 to form a descending array phi 33 < phi 32 < phi 31. According to the mapping rule that the larger the second reference factor is, the smaller the third random number is, thet ₁₁Is mapped with respect to phi 33 and,t ₁₂is mapped with respect to the phi 32 map,t ₁₃and is mapped with phi 31. Based on this, the margin is equal to φ 33 for the 1 st voiceprint feature of class 1, φ 32 for the 2 nd voiceprint feature of class 1, and φ 31 for the 3 rd voiceprint feature of class 1.

Similarly, the 2 nd first reference factor is taken as the mean value, and 3 nd with the first Gaussian distribution is givenFour random numbers φ 41, φ 42, φ 43. For class 2, 1 st second reference factort ₂₁Class 2, 2 nd second reference factort ₂₂Class 2, 3 rd second reference factort ₂₃Performing ascending arrangement to obtain ascending arrayt ₂₁＞t ₂₂＞t ₂₃And carrying out descending arrangement on phi 41, phi 42 and phi 43 to form a descending array phi 43 < phi 42 < phi 41. According to the mapping rule that the larger the second reference factor is, the smaller the fourth random number is, thet ₂₁Is mapped with respect to the phi 43 map,t ₂₂is mapped with respect to phi 42 and,t ₂₃mapped with phi 41. Based on this, the margin is equal to φ 43 for the 1 st voiceprint feature of class 2, φ 42 for the 2 nd voiceprint feature of class 2, and φ 41 for the 3 rd voiceprint feature of class 2.

From the above, the method and the device for determining the voiceprint feature space can perform margin assignment on the training samples according to the distribution characteristics of the internal data of the sample audio data set in the feature space, meanwhile, the quality and the quantity of the training data are considered, and the voiceprint feature space is constructed in a more reasonable mode by combining the Gaussian distribution characteristics of the training data, so that the purpose of improving the performance of the voiceprint recognition system is achieved.

The exemplary embodiment of the present disclosure also provides a voiceprint recognition method, which may be performed by the server shown in fig. 2 or a chip applied to the server. Fig. 4 shows a schematic flow chart of a voiceprint recognition method of an exemplary embodiment of the present disclosure. As shown in fig. 4, a voiceprint recognition method provided by an exemplary embodiment of the present disclosure includes:

step 401: and acquiring audio data to be identified. The server can collect the audio data to be identified through the user equipment, and the user equipment can perform noise reduction processing on the audio data to be identified in advance or directly upload the audio data to the server without performing noise reduction processing.

Step 402: and extracting the voiceprint characteristics of the audio data to be recognized by using a voiceprint recognition model, wherein the voiceprint recognition model is obtained by training through a training method shown in fig. 3.

In practical application, the server can call the model parameters, the neural source data and the like stored in the data storage system, and extract the voiceprint features of the audio data to be identified. The architecture of the voiceprint recognition model can be ResNet, X-vector and the like.

Step 403: based on the voiceprint features, a speaker identity of the audio data to be recognized is determined.

In an example, the voiceprint features pre-stored in the data storage system may also be the voiceprint features re-extracted by the voiceprint recognition model, which is not described herein again. In another example, the server, in determining the identity of the speaker identifying the audio data, may call from the data storage system the pre-stored voiceprint features that have been guaranteed, compare with the voiceprint features extracted in step 302, and determine whether the voiceprint feature data satisfies the pre-stored voiceprint features. Here, the voiceprint characteristics pre-stored in the data storage system are also different according to the application scenario of the voiceprint recognition method.

For example: when the voiceprint recognition method is used to determine whether the same speaker is present, voiceprint characteristics of other speakers may be pre-stored. When the voiceprint features extracted in step 302 are compared with the voiceprint features of other pre-stored speakers, the similarity between the two is compared, and when the similarity is greater than or equal to a threshold (e.g., 0.5), the two voiceprint features are considered to be from the same speaker, otherwise, the two voiceprint features are considered to be from different speakers. Another example is: when the voiceprint recognition method is used to determine the identity of a speaker, the voiceprint characteristics of the user may be pre-stored. And when the voiceprint features extracted in the step 302 are the same as the voiceprint features of other pre-stored speakers, determining that the audio data to be identified collected by the user equipment is the audio sent by the user.

For example, the user device may have a display interface, and a client having a voiceprint recognition function may be installed in the user device. Such as a voiceprint lock, etc. Fig. 5 shows a schematic interface diagram presented by a user device of an exemplary embodiment of the present disclosure. As shown in fig. 5, a voiceprint lock interface 502 is displayed on a display screen 501 of the user equipment 500, and a user can operate on interface prompt text, so that the user equipment collects audio to be recognized sent by the user.

For example: voiceprint lock interface prompt: please press the microphone button to speak, the user may press the microphone button 503 of fig. 5 to speak the pre-saved audio. After the user equipment collects the audio, the audio can be automatically uploaded to a server through a communication network for voiceprint recognition. If the voiceprint recognition is successful, the server sends an unlocking instruction to the voiceprint lock, and then the purpose of voiceprint unlocking is achieved. If voiceprint recognition fails, the user may repeat the previous operation several times (e.g., 2 times). If the voiceprint recognition results fail after the preset times (such as 3 times), the server can issue a locking instruction to the voiceprint lock, and the user can not repeat the previous operation any more, so that the internal data of the user equipment is protected.

The above description mainly introduces the scheme provided by the embodiment of the present disclosure from the perspective of a server. It is understood that the server includes hardware structures and/or software modules for performing the respective functions in order to implement the above-described functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The embodiment of the present disclosure may perform division of functional units on the server according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiments of the present disclosure is illustrative, and is only one division of logic functions, and there may be another division in actual implementation.

In the case of dividing each functional module by corresponding each function, the exemplary embodiments of the present disclosure provide a training apparatus, which may be a deep learning processor or a chip applied to the deep learning processor. FIG. 6 shows a schematic block diagram of functional modules of a training apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 6, the training apparatus 600 includes:

an obtaining module 601, configured to obtain a sample audio data set, where the sample audio data set includes a plurality of sample audio data;

a training module 602, configured to train a voiceprint recognition model based on the sample audio data set, where the voiceprint recognition model is configured to determine a voiceprint feature corresponding to each sample audio data in the multiple sample audio data sets, and where, for different voiceprint features, a loss function of the voiceprint recognition model contains different margins, and the margins are associated with distances from the voiceprint features to class centers of classes to which the voiceprint features belong.

In the case of dividing each functional module by corresponding each function, the exemplary embodiments of the present disclosure provide a voiceprint recognition apparatus, which may be a server or a chip applied to the server. Fig. 7 shows a functional block schematic block diagram of a voiceprint recognition apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 7, the voiceprint recognition apparatus 700 includes:

an obtaining module 701, configured to obtain audio data to be identified;

an extracting module 702, configured to extract a voiceprint feature of the audio data to be recognized by using a voiceprint recognition model, where the voiceprint recognition model is obtained by training according to the above exemplary training method;

a determining module 703, configured to determine a speaker identity of the audio data to be recognized based on the voiceprint feature.

Fig. 8 shows a schematic block diagram of a chip according to an exemplary embodiment of the present disclosure. As shown in fig. 8, the chip 800 includes one or more (including two) processors 801 and a communication interface 802. The communication interface 802 may support the server to perform the data transceiving steps in the training method and the voiceprint recognition model, and the processor 801 may support the server to perform the data processing steps in the training method and the voiceprint recognition model.

Optionally, as shown in fig. 8, the chip 800 further includes a memory 803, and the memory 803 may include a read-only memory and a random access memory, and provides the processor with operation instructions and data. The portion of memory may also include non-volatile random access memory (NVRAM).

In some embodiments, as shown in fig. 8, the processor 801 executes the corresponding operation by calling an operation instruction stored in the memory (the operation instruction may be stored in the operating system). The processor 801 controls the processing operations of any of the terminal devices, and may also be referred to as a Central Processing Unit (CPU). The memory 803 may include both read-only memory and random-access memory, and provides instructions and data to the processor 801. A portion of the memory 903 may also include NVRAM. For example, in applications where the memory, communication interface, and memory are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 804 in FIG. 8.

The method disclosed by the embodiment of the disclosure can be applied to a processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA (field-programmable gate array) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present disclosure may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to embodiments of the present disclosure, such as a training method and/or a voiceprint recognition method.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium, wherein the non-transitory computer readable storage medium stores a computer program, which when executed by a processor of a computer, is operative to cause the computer to perform a method according to an embodiment of the present disclosure, such as a training method and/or a voiceprint recognition method.

Exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when being executed by a processor of a computer, is adapted to cause the computer to carry out a method according to an embodiment of the present disclosure, such as a training method and/or a voiceprint recognition method

Referring to fig. 9, a block diagram of a structure of an electronic device 900, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

As shown in fig. 9, a number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 908, and a communication unit 909. The input unit 906 may be any type of device capable of inputting information to the electronic device 900, and the input unit 906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 907 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 908 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

As shown in FIG. 9, computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above. For example, in some embodiments, the aforementioned training methods and voiceprint recognition methods can be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. In some embodiments, the computing unit 901 may be configured to perform the training method and the voiceprint recognition method by any other suitable means (e.g., by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A method of training, comprising:

training a voiceprint recognition model based on the sample audio data set, wherein the voiceprint recognition model is used for determining a voiceprint feature corresponding to each sample audio data in the multiple sample audio data sets, and for different voiceprint features, a loss function of the voiceprint recognition model contains different margins, the margins are associated with distances from the voiceprint features to class centers of classes to which the voiceprint features belong, and for multiple voiceprint features belonging to the same class, each margin and a second reference factor satisfy a second mapping relation, and the second mapping relation satisfies: the margin is inversely related to a second reference factor of the margin map, the second reference factor being used to evaluate the quality of the voiceprint feature.

2. The method of claim 1, wherein the margin is positively correlated to a distance from the voiceprint feature to a center of a class to which the voiceprint feature belongs.

3. The method of claim 1, wherein the margin satisfies a first Gaussian distribution for a plurality of voiceprint features belonging to a same class.

4. The method according to claim 3, wherein the mean of the first Gaussian distribution satisfies a second Gaussian distribution for voiceprint features belonging to different classes, wherein the mean of the first Gaussian distribution equals a first reference factor for voiceprint features belonging to the same class, and wherein the first reference factor satisfies a first mapping relation with the actual class quality of the voiceprint features;

wherein the first mapping relation satisfies: the first reference factor is inversely related to the actual class quality of the first reference factor map.

5. The method according to claim 1, wherein in the case that the number of voiceprint features belonging to each category is 1, for a plurality of voiceprint features belonging to different categories, the margin satisfies a second gaussian distribution, and the margin and the actual category quality of the voiceprint features satisfy a first mapping relation;

the first mapping relation satisfies: the margin is inversely related to the actual class quality of the margin map.

6. Method according to claim 4 or 5, characterized in that for voiceprint features belonging to the same class, the actual class quality satisfies:

the actual class quality is positively correlated with the average class quality of the voiceprint features, and the actual class quality is negatively correlated with the number of voiceprint features belonging to the same class.

7. The method of claim 6, wherein the actual class quality satisfies:

，

wherein the content of the first and second substances,

indicates to belong toiThe average class quality of the voiceprint features of a class,K _iindicates to belong toiA ratio of voiceprint characteristics of a class to a number of sample audio data in the sample audio data set,αdenotes the balance factor, 0 <α＜1。

8. The method of claim 6, wherein the average class quality of the voiceprint features satisfies:

(ii) a The number of voiceprint features belonging to the same category satisfies:

；

wherein the content of the first and second substances,Nrepresenting the amount of sample audio data in the sample audio data set,n _iindicates to belong toiThe number of voiceprint features of a category,iis not less than 1 and not more thanCThe number of the integer (c) of (d),kis not less than 1 and not more thann _iThe number of the integer (c) of (d),CandNis an integer greater than or equal to 2,n _iis an integer greater than or equal to 1;

indicates to belong toiThe average quality of the voiceprint features of a class,K _iindicates to belong toiA ratio of voiceprint characteristics of a class to a number of sample audio data in the sample audio data set,θ _ikindicates to belong toiClass onekThe vocal print features and belongs toiThe angle of the weight vector of the voiceprint features of a category.

9. The method according to any one of claims 1 to 5,

for each voiceprint feature belonging to the same class, the second reference factor satisfies at least one of the following conditions:

the first condition is that: the second reference factor is positively correlated with the quality of the voiceprint features, and the second reference factor is negatively correlated with the sum of the qualities of the voiceprint features belonging to the same class;

the second condition is that: the second reference factor is used for representing the ratio of the quality of the vocal print features to the sum of the qualities of the vocal print features belonging to the same class;

the quality of the voiceprint features is used to characterize the distance between the voiceprint features and the weight vector of the class to which the voiceprint features belong.

10. The method of claim 9, wherein the quality of the voiceprint features satisfies:M _ij=1+cos(θ _ij) The sum of the qualities of the voiceprint features belonging to the same category satisfies:

；

wherein the content of the first and second substances,iis greater than or equal to 1 and less than or equal toCThe number of the integer (c) of (d),Cis an integer greater than or equal to 2;

jandkis greater than or equal to 1 and less than or equal ton _iThe number of the integer (c) of (d),n _iindicates to belong toiThe number of voiceprint features of a category,n _iis an integer greater than or equal to 2;

M _iindicates to belong toiThe sum of the qualities of the voiceprint features of the classes,M _ijindicates to belong toiClass onejThe quality of the individual voice print features,θ _ijindicates to belong toiClass onejThe vocal print features and belongs toiClamp of weight vectors for voiceprint features of a categoryThe angle of the corner is such that,θ _ikindicates to belong toiClass onekThe vocal print features and belongs toiThe angle of the weight vector of the voiceprint features of a category.

11. The method according to any one of claims 1 to 5, wherein the loss function satisfies:

；

wherein the content of the first and second substances,Lin order to be a function of the loss,λ _nis shown asnThe balance of the individual voiceprint characteristics,θ _ynto representx _nAndW _ynthe angle of,θ _n,mto representx _nAndW _mthe angle of,x _nis shown asnThe characteristics of the individual voice print are,nis greater than or equal to 1 and less than or equal toNThe number of the integer (c) of (d),Nrepresenting a quantity of sample audio data in the sample audio data set;

W _ynindicates to belong toynWeight vector of voiceprint features of a class, number onenThe vocal print is characterized as belonging toynOne of the voiceprint features of the category,W _mindicates to belong tomA weight vector for the voiceprint features of a class,m≠yn，mandynare all greater than or equal to 1 and less than or equal toCThe number of the integer (c) of (d),Cthe number of categories representing the voiceprint characteristics,CandNare all integers greater than 2 and are each,srepresenting a hyper-parameter.

12. A voiceprint recognition method, comprising:

acquiring audio data to be identified;

extracting the voiceprint characteristics of the audio data to be recognized by utilizing a voiceprint recognition model, wherein the voiceprint recognition model is obtained by training according to the training method of any one of claims 1 to 11;

13. An exercise device, comprising:

a training module, configured to train a voiceprint recognition model based on the sample audio data set, where the voiceprint recognition model is configured to determine a voiceprint feature corresponding to each sample audio data in the multiple sample audio data sets, and where, for different voiceprint features, a loss function of the voiceprint recognition model contains different margins, and the margins are associated with distances from the voiceprint features to a class center of a class to which the voiceprint features belong, and for multiple voiceprint features belonging to the same class, each margin and a second reference factor satisfy a second mapping relationship, where the second mapping relationship satisfies: the margin is inversely related to a second reference factor of the margin map, the second reference factor being used to evaluate the quality of the voiceprint feature.

14. A voiceprint recognition apparatus comprising:

the acquisition module is used for acquiring audio data to be identified;

an extraction module, configured to extract a voiceprint feature of the audio data to be recognized by using a voiceprint recognition model, where the voiceprint recognition model is obtained by training according to the training method of any one of claims 1 to 11;

15. An electronic device, comprising:

a processor; and the number of the first and second groups,

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method of any one of claims 1 to 11 or claim 12.

16. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-11 or 12.