CN110223699B

CN110223699B - Speaker identity confirmation method, device and storage medium

Info

Publication number: CN110223699B
Application number: CN201910407670.8A
Authority: CN
Inventors: 蔡晓东; 李波; 黄玳
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2021-04-13
Anticipated expiration: 2039-05-15
Also published as: CN110223699A

Abstract

The invention discloses a method, a device and a storage medium for confirming the identity of a speaker, belonging to the technical field of voiceprint recognition. The method comprises the following steps: acquiring a trained speaker confirmation neural network; and inputting the voice of the speaker to be recognized and the voice database of the speaker to the trained speaker confirmation neural network, and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized. The device comprises an acquisition module and an identification module. The invention enables the speaker verification to be carried out based on the trained speaker verification neural network, so as to improve the accuracy and stability of speaker identity recognition of the speaker verification neural network.

Description

Speaker identity confirmation method, device and storage medium

Technical Field

The invention relates to the technical field of voiceprint recognition, in particular to a method and a device for confirming the identity of a speaker and a storage medium.

Background

Voiceprint recognition, also known as speaker recognition, includes speaker verification techniques and speaker identification techniques. The speaker identification means to identify the identity information of the speaker by using known audio and voice information. At present, speaker confirmation technology is mature day by day, and the technology is widely applied to practical applications such as public security investigation criminals, urban community personnel management, office area voiceprint attendance and the like.

However, the existing voiceprint recognition technology usually has a high recognition rate in an experimental environment, but is greatly limited due to the influence of complex environmental factors in practical application, so that the voiceprint recognition technology is far from achieving the expected effect in practical work.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method, an apparatus and a storage medium for confirming the identity of a speaker, so that the speaker can confirm the neural network more accurately.

In order to solve the above technical problem, the present invention provides a speaker identity verification method, which comprises the following steps:

acquiring a trained speaker confirmation neural network;

inputting the voice of the speaker to be recognized and the voice database of the speaker to the trained speaker confirmation neural network, and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized; wherein the speaker voice database includes a plurality of different voices of a plurality of different speakers.

The invention has the beneficial effects that: and identifying the identity of the speaker corresponding to the voice of the speaker to be identified by acquiring the trained speaker validation neural network and inputting the voice of the speaker to be identified and the speaker voice database into the trained speaker validation neural network. The speaker confirmation can be carried out based on the trained speaker confirmation neural network, so that the accuracy and the stability of speaker identity recognition of the speaker confirmation neural network are improved.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the method also comprises the step of training the speaker to confirm the neural network in advance, and specifically comprises the following steps:

constructing a speaker confirmation neural network for extracting speaker characteristic representation;

selecting training samples based on different speech subsets of different speakers;

determining an expansion similarity matrix according to the training sample;

and training the speaker confirmation neural network based on the training sample and the extended similarity matrix to obtain the trained speaker confirmation neural network.

The beneficial effect of adopting the further scheme is that: firstly, a speaker confirmation neural network is constructed, then a training sample is selected, an extended similarity matrix is determined according to the training sample, the speaker confirmation neural network is trained on the basis of the training sample and the extended similarity matrix, the trained speaker confirmation neural network can be obtained, the speaker confirmation can be carried out on the basis of the trained speaker confirmation neural network, and therefore the accuracy of speaker identity recognition and the stability of recognition of the speaker confirmation neural network are improved.

Further, the constructing of the speaker verification neural network specifically includes:

obtaining a voice sample;

extracting acoustic features of the voice sample;

and inputting the acoustic features into an LSTM network to learn the speaker feature representation of the voice sample, and obtaining the speaker confirmation neural network.

The beneficial effect of adopting the further scheme is that: extracting acoustic features of the voice sample; the acoustic features are sent to an LSTM network for learning the acoustic features of the speaker in the voice sample, and a simple speaker verification neural network capable of extracting speaker feature representation can be obtained.

Further, the training samples include a speech sample to be recognized, a positive training sample, a negative training sample for comparison, and an auxiliary training sample for supplementing the number of the positive training samples, and the selecting the training samples based on different speech subsets of different speakers specifically includes:

selecting N different speakers, wherein the different speakers comprise a target speaker and N-1 comparison speakers, each speaker in the different speakers selects N-1 voice subsets, and each voice subset comprises M sentences of voices;

selecting a voice subset from the voice subsets of the target speaker as a target voice subset, and selecting a sentence of voice from the target voice subset as the voice sample to be recognized; and using other voices in the target voice subset as the positive training samples;

using the other voice subsets except the target voice subset in the voice subsets of the target speaker as the auxiliary training samples;

and selecting one voice subset from the voice subsets of the comparison speakers as the negative training sample.

The beneficial effect of adopting the further scheme is that: by adding the auxiliary training samples in the selection of the training samples, the original one-to-one or one-to-many training mode of the positive training samples and the negative training samples is changed into many-to-many, namely the number of the positive training samples and the auxiliary training samples is equal to that of the negative training samples, so that the training samples for confirming the neural network by the speaker are more balanced, and the accuracy and the speed for confirming the neural network by the trained speaker are improved.

Further, the determining an extended similarity matrix according to the training sample specifically includes:

obtaining a positive training sample center, a negative training sample center and an auxiliary training sample center according to the positive training sample, the negative training sample and the auxiliary training sample;

obtaining a distance value between the voice sample to be recognized and the center of the forward training sample according to the voice sample to be recognized and the center of the forward training sample, and constructing a vector matrix based on the distance value between the voice sample to be recognized and the center of the forward training sample;

acquiring a distance value between the voice sample to be recognized and the center of the negative training sample, and constructing a similarity matrix of the negative training sample based on the distance value between the voice sample to be recognized and the center of the negative training sample;

combining the vector matrix and the negative training sample similarity matrix into a positive and negative similarity matrix;

acquiring a distance value between the voice sample to be recognized and the center of the auxiliary training sample, and establishing an auxiliary similarity matrix based on the distance value between the voice sample to be recognized and the center of the auxiliary training sample;

and obtaining the extended similarity matrix according to the positive and negative similarity matrix and the auxiliary similarity matrix.

The beneficial effect of adopting the further scheme is that: by constructing the extended similarity matrix, the auxiliary training sample can be integrated into the training of the speaker confirmation neural network so as to train and obtain a speaker confirmation neural network model with greatly improved recognition accuracy and speed.

Further, the method further comprises:

and constructing a loss function, and carrying out optimization convergence on the speaker confirmation neural network based on the loss function.

Further, the expression of the loss function is:

wherein e is_i,oRepresenting the ith speech sample in the speech subset of the target speaker o to be recognized in the speech samples to be recognized; n represents the number of different speakers; k represents the kth auxiliary speech subset in the speech sample to be recognized; j represents the jth speech subset in the negative training sample; σ denotes sigmoid function, S_i,ok,assRepresenting the distance value between the speech sample to be recognized and the center of the auxiliary training sample of the kth auxiliary speech subset, S_i,oi,posRepresenting the distance value, S, between the speech sample to be recognized and the center of the training sample_i,oj,negRepresenting the distance value between the voice sample to be recognized and the negative training sample center of the jth voice subset; alpha is a regulatory factor.

The beneficial effect of adopting the further scheme is that: and selecting the optimal sample center distance through the loss function, so that the speaker confirmation neural network can be quickly converged, and the calculation amount of the network is reduced.

In order to solve the above technical problem, an embodiment of the present invention further provides a device for confirming the identity of a speaker, including:

the acquisition module is used for acquiring the trained speaker confirmation neural network;

and the recognition module is used for inputting the voice of the speaker to be recognized and the voice database of the speaker to be recognized into the trained speaker confirmation neural network and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized.

To solve the above technical problem, an embodiment of the present invention further provides a computer-readable storage medium, including instructions, which when executed on a computer, cause the computer to execute the speaker identity verification method according to any one of the above embodiments.

In order to solve the above technical problem, an embodiment of the present invention further provides a speaker identity verification apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the speaker identity verification method according to any one of the above embodiments when executing the program.

Drawings

Fig. 1 is a schematic flowchart of a method for validating the identity of a speaker according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a negative training sample center, an auxiliary training sample center, and a positive training sample center according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a construction of an extended similarity matrix according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a mapping relationship between a positive training sample, a negative training sample, and an auxiliary training sample according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a speaker identity verification apparatus according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a schematic flowchart of a method for confirming the identity of a speaker according to an embodiment of the present invention, as shown in fig. 1, in this embodiment, a method for confirming the identity of a speaker includes the following steps:

acquiring a trained speaker confirmation neural network;

The trained speaker validation neural network is used for speaker validation by obtaining the trained speaker validation neural network. And identifying the identity of the speaker corresponding to the voice of the speaker to be identified by inputting the voice of the speaker to be identified and the database of the voice of the speaker to be identified into the trained speaker confirmation neural network.

It should be noted that the speaker voice database is a voice database composed of a plurality of different voices of a plurality of different speakers, and the recognition process is to match the voice of the speaker to be recognized with the voice sample in the speaker voice database so as to confirm the identity of the speaker corresponding to the voice of the speaker to be recognized.

In the above embodiment, the identity of the speaker corresponding to the voice of the speaker to be recognized is recognized by acquiring the trained speaker validation neural network and inputting the voice of the speaker to be recognized and the speaker voice database to the trained speaker validation neural network. The speaker confirmation can be carried out based on the trained speaker confirmation neural network, so that the accuracy and the stability of speaker identity recognition of the speaker confirmation neural network are improved.

Optionally, the method further includes a step of training the speaker validation neural network in advance, specifically including:

determining an expansion similarity matrix according to the training sample;

In the above embodiment, the method includes constructing the speaker validation neural network, selecting the training sample, determining the extended similarity matrix according to the training sample, and training the speaker validation neural network based on the training sample and the extended similarity matrix, so as to obtain the trained speaker validation neural network, so that speaker validation can be performed based on the trained speaker validation neural network, thereby improving accuracy and stability of speaker identity recognition by the speaker validation neural network.

Specifically, the constructing of the speaker verification neural network specifically includes:

obtaining a voice sample;

extracting acoustic features of the voice sample;

It is worth mentioning that the obtaining of the voice sample further comprises preprocessing the voice sample. The method comprises the steps of performing frame windowing on each voice sample, taking 25ms as one frame, moving the frame to 10ms, and taking the first 180 frames and the second 180 frames of each voice sample as input data. The filter bank acoustic features of each frame are then extracted, and the acoustic features of the speech samples are then fed into a 3-layer LSTM (Long-Short Term Memory, LSTM) network, which is a time Recurrent Neural Network (RNN), to learn a speaker feature representation of the speech samples.

In the above embodiment, the acoustic features of the voice sample are extracted; the acoustic features are sent to an LSTM network for learning the acoustic features of the speaker in the speech sample, and a simple speaker verification neural network can be obtained.

Specifically, the training samples include a speech sample to be recognized, a positive training sample, a negative training sample for comparison, and an auxiliary training sample for supplementing the number of the positive training samples, and the selecting the training samples based on different speech subsets of different speakers specifically includes:

Because the conventional sample selection method for training the speaker verification neural network is one-to-one or one-to-many, that is, one positive training sample to one negative training sample or one positive training sample to a plurality of negative training samples, the recognition accuracy of the neural network is reduced due to imbalance of the training samples, the auxiliary training samples are introduced into the selection of the training samples, and the auxiliary training samples are used for supplementing the number of the positive training samples, so that the original one-to-many training method is changed into many-to-many, for example, three auxiliary training samples are introduced, and the training method is that four negative training samples are paired from one positive training sample to three auxiliary training samples.

It should be noted that the sampling method of the training samples is as follows: n different speakers are randomly selected, including a target speaker and (N-1) contrasted speakers, wherein each speaker randomly selects (N-1) speech subsets, and each speech subset contains M sentences of speech. Randomly selecting a sentence of voice in a voice subset of a target speaker as a voice sample to be recognized, wherein other voices in the voice subset are used as positive training samples, and the rest (N-2) voice subsets are used as auxiliary voice subsets and are called auxiliary training samples. In addition, the auxiliary training samples are used to supplement the number of positive training samples, such that the number of positive training samples plus the auxiliary training samples equals the number of negative training samples.

In the above embodiment, the auxiliary training samples are added in the selection of the training samples, so that the original one-to-one or one-to-many training mode is changed into many-to-many, and the training samples are more balanced, thereby improving the accuracy and the speed of the trained speaker in confirming the neural network recognition.

Specifically, the determining an extended similarity matrix according to the training sample specifically includes:

As shown in fig. 2, sample centers of the positive training sample, the negative training sample, and the auxiliary training sample are calculated respectively to obtain a positive training sample center, a negative training sample center, and an auxiliary training sample center; wherein the function of calculating the center of the negative training sample is as follows:

wherein, c_j,o,negRepresenting a negative training sample center, where o represents a targeted speaker; j represents the voice subset of the jth contrasted speaker in the negative training sample; e.g. of the type_jmRepresenting the mth sentence of speech in the subset of speech in the jth negative training sample; m denotes that the speech subset of each speaker in the negative training sample has M sentences of speech.

Wherein the function for calculating the center of the auxiliary training sample is as follows:

c_k,o,assrepresenting an assisted training sample center, wherein o represents a targeted speaker; k represents the kth auxiliary speech subset in the auxiliary training sample; e.g. of the type_kmRepresenting the mth sentence of speech in the kth auxiliary speech subset in the auxiliary training sample; m represents that the speech subsets in the auxiliary training sample have M sentences of speech.

Wherein the function of calculating the center of the positive training sample is as follows:

in the formula, the first and second organic solvents are,

representing a positive training sample center; where o represents the target speaker and (-i) represents the exclusion of the ith sentence of speech; e.g. of the type_mRepresenting the mth sentence voice in the voice subset of the speaker to be recognized; m represents that the speech subset of the positive training samples has M sentences of speech.

It should be noted that the formula for calculating the distance value between the speech sample to be recognized and the center of the training sample is:

S_i,oi,posrepresenting the distance between the voice sample to be recognized and the center of the training sample, wherein i belongs to (1, M); w and b are learning parameters.

The calculation formula for calculating the distance value between the voice sample to be recognized and the center of the negative training sample is as follows:

S_i,oj,neg＝w·cos(e_i,o,c_j,o,neg)+b

wherein S is_i,oj,negRepresenting the distance between the voice sample to be recognized and the j th negative training sample center, wherein i belongs to (1, M); j ∈ (1, N-1); w and b are learning parameters.

The formula for calculating the distance value between the voice sample to be recognized and the center of the auxiliary training sample is as follows:

S_i,ok,ass＝w·cos(e_i,o,c_k,o,ass)+b

S_i,ok,assrepresenting the distance between the voice sample to be recognized and the center of the kth auxiliary training sample; wherein i ∈ (1, M); k ∈ (1, N-2); w and b are learning parameters.

Fig. 3 is a schematic diagram illustrating a construction of an extended similarity matrix according to an embodiment of the present invention, and as shown in fig. 3, first, a distance value between the to-be-recognized speech sample and the center of the training sample is calculated, and a vector matrix is constructed based on the distance value between the to-be-recognized speech sample and the center of the training sample;

then, calculating a distance value between the voice sample to be recognized and the center of the negative training sample, and constructing a similarity matrix of the negative training sample based on the distance value between the voice sample to be recognized and the center of the negative training sample;

then, combining the vector matrix and the negative training sample similarity matrix into a positive and negative similarity matrix;

then, calculating a distance value between the voice sample to be recognized and the center of the auxiliary training sample, and establishing an auxiliary similarity matrix based on the distance value between the voice sample to be recognized and the center of the auxiliary training sample;

and finally, obtaining the extended similarity matrix according to the positive and negative similarity matrix and the auxiliary similarity matrix.

Namely, the distance value between the voice sample to be recognized and the center of the positive training sample is constructed into a vector matrix, the distance value between the voice sample to be recognized and the center of the negative training sample is constructed into a negative training sample similarity matrix, and then the vector matrix and the negative training sample similarity matrix are combined into a positive and negative similarity matrix.

In the invention, the auxiliary training sample is introduced in the selection of the training sample, so the auxiliary training sample is also blended into the similarity matrix when the similarity matrix is calculated. Firstly, calculating the distance values between the voice sample to be recognized and the center of the auxiliary training sample, and then combining the distance values between all the voice sample to be recognized and the center of the auxiliary training sample into an auxiliary similarity matrix; then, the positive and negative similarity matrixes and the auxiliary similarity matrix are combined into a new similarity matrix, namely the expanded similarity matrix. By the aid of the extended similarity matrix, the auxiliary training sample can be integrated into training in the speaker validation neural network.

Fig. 4 is a schematic diagram of a mapping relationship between positive training samples, negative training samples, and auxiliary training samples provided in an embodiment of the present invention, and as shown in fig. 4, the number of the positive training samples combined with the auxiliary training samples is equal to the number of the negative training samples by introducing the auxiliary training samples.

It should be noted that the auxiliary training samples are introduced because in the conventional training of the speaker verification neural network, the positive training samples and the negative training samples are selected one-to-one or one-to-many, which results in the imbalance of the training samples, causes the unfairness of the training samples, and makes the neural network not effective enough. Therefore, the new auxiliary training sample is introduced in the invention to make the training sample more balanced so as to improve the accuracy and speed of the speaker for confirming the neural network recognition.

Optionally, the method further comprises:

Specifically, the expression of the loss function is:

Because the auxiliary training sample is introduced into the training sample, a new loss function needs to be designed, and the minimum value between the voice sample to be recognized and the center of the auxiliary training sample or the center of the training sample, namely the center point of the speaker farthest from the voice sample to be recognized, is selected through the loss function; and the center point of the contrasted speaker closest to the center of the negative training sample, namely the center point of the contrasted speaker closest to the voice sample to be recognized, participates in the calculation of the loss function. The loss function selects the optimal sample center distance, so that the network can be converged quickly, and more accurate speaker recognition is realized.

Specifically, the overall ATS-GE2E loss value is L (e)_oi) The specific mathematical expression is as follows:

wherein, (o is belonged to (1, N), i is belonged to (1, M).

Meanwhile, as shown in fig. 5, an embodiment of the present invention further provides a speaker identity verification apparatus, including:

Meanwhile, an embodiment of the present invention further provides a computer-readable storage medium, which includes instructions, and when the instructions are executed on a computer, the instructions cause the computer to execute the speaker identity verification method according to any one of the above embodiments.

Meanwhile, the embodiment of the invention also provides a speaker identity confirmation device, which comprises a memory, a processor and a computer program which is stored on the memory and can be run on the processor, wherein when the processor executes the program, the speaker identity confirmation method in any one of the above embodiments is realized.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for speaker identity verification, comprising the steps of:

acquiring a trained speaker confirmation neural network;

inputting the voice of the speaker to be recognized and the voice database of the speaker to the trained speaker confirmation neural network, and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized; wherein the speaker voice database comprises a plurality of different voices of a plurality of different speakers;

the method also comprises the step of training the speaker to confirm the neural network in advance, and specifically comprises the following steps:

determining an expansion similarity matrix according to the training sample;

training the speaker confirmation neural network based on the training sample and the extended similarity matrix to obtain the trained speaker confirmation neural network;

the method includes the steps of selecting training samples based on different speech subsets of different speakers, wherein the training samples include a speech sample to be recognized, a positive training sample, a negative training sample for comparison, and an auxiliary training sample for supplementing the number of the positive training samples, and specifically includes:

selecting a voice subset from the voice subsets of the contrasted speakers as the negative training sample;

wherein, the determining an extended similarity matrix according to the training samples specifically includes:

2. The method for confirming the identity of a speaker according to claim 1, wherein the constructing of the speaker confirmation neural network specifically comprises:

obtaining a voice sample;

extracting acoustic features of the voice sample;

3. The speaker identity verification method according to claim 1 or 2, wherein the method further comprises:

4. The speaker identity verification method of claim 3, wherein the loss function is expressed as:

5. A speaker identity verification apparatus, comprising:

the recognition module is used for inputting the voice of the speaker to be recognized and the voice database of the speaker to be recognized into the trained speaker confirmation neural network and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized;

the acquisition module is further used for training the speaker to confirm the neural network in advance, and specifically comprises:

determining an expansion similarity matrix according to the training sample;

6. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform a method for speaker identity verification according to any of claims 1 to 4.

7. A speaker identity verification device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the speaker identity verification method according to any one of claims 1 to 4.