CN110223699B - Speaker identity confirmation method, device and storage medium - Google Patents

Speaker identity confirmation method, device and storage medium Download PDF

Info

Publication number
CN110223699B
CN110223699B CN201910407670.8A CN201910407670A CN110223699B CN 110223699 B CN110223699 B CN 110223699B CN 201910407670 A CN201910407670 A CN 201910407670A CN 110223699 B CN110223699 B CN 110223699B
Authority
CN
China
Prior art keywords
voice
speaker
sample
training sample
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910407670.8A
Other languages
Chinese (zh)
Other versions
CN110223699A (en
Inventor
蔡晓东
李波
黄玳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN201910407670.8A priority Critical patent/CN110223699B/en
Publication of CN110223699A publication Critical patent/CN110223699A/en
Application granted granted Critical
Publication of CN110223699B publication Critical patent/CN110223699B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces

Abstract

The invention discloses a method, a device and a storage medium for confirming the identity of a speaker, belonging to the technical field of voiceprint recognition. The method comprises the following steps: acquiring a trained speaker confirmation neural network; and inputting the voice of the speaker to be recognized and the voice database of the speaker to the trained speaker confirmation neural network, and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized. The device comprises an acquisition module and an identification module. The invention enables the speaker verification to be carried out based on the trained speaker verification neural network, so as to improve the accuracy and stability of speaker identity recognition of the speaker verification neural network.

Description

Speaker identity confirmation method, device and storage medium
Technical Field
The invention relates to the technical field of voiceprint recognition, in particular to a method and a device for confirming the identity of a speaker and a storage medium.
Background
Voiceprint recognition, also known as speaker recognition, includes speaker verification techniques and speaker identification techniques. The speaker identification means to identify the identity information of the speaker by using known audio and voice information. At present, speaker confirmation technology is mature day by day, and the technology is widely applied to practical applications such as public security investigation criminals, urban community personnel management, office area voiceprint attendance and the like.
However, the existing voiceprint recognition technology usually has a high recognition rate in an experimental environment, but is greatly limited due to the influence of complex environmental factors in practical application, so that the voiceprint recognition technology is far from achieving the expected effect in practical work.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method, an apparatus and a storage medium for confirming the identity of a speaker, so that the speaker can confirm the neural network more accurately.
In order to solve the above technical problem, the present invention provides a speaker identity verification method, which comprises the following steps:
acquiring a trained speaker confirmation neural network;
inputting the voice of the speaker to be recognized and the voice database of the speaker to the trained speaker confirmation neural network, and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized; wherein the speaker voice database includes a plurality of different voices of a plurality of different speakers.
The invention has the beneficial effects that: and identifying the identity of the speaker corresponding to the voice of the speaker to be identified by acquiring the trained speaker validation neural network and inputting the voice of the speaker to be identified and the speaker voice database into the trained speaker validation neural network. The speaker confirmation can be carried out based on the trained speaker confirmation neural network, so that the accuracy and the stability of speaker identity recognition of the speaker confirmation neural network are improved.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, the method also comprises the step of training the speaker to confirm the neural network in advance, and specifically comprises the following steps:
constructing a speaker confirmation neural network for extracting speaker characteristic representation;
selecting training samples based on different speech subsets of different speakers;
determining an expansion similarity matrix according to the training sample;
and training the speaker confirmation neural network based on the training sample and the extended similarity matrix to obtain the trained speaker confirmation neural network.
The beneficial effect of adopting the further scheme is that: firstly, a speaker confirmation neural network is constructed, then a training sample is selected, an extended similarity matrix is determined according to the training sample, the speaker confirmation neural network is trained on the basis of the training sample and the extended similarity matrix, the trained speaker confirmation neural network can be obtained, the speaker confirmation can be carried out on the basis of the trained speaker confirmation neural network, and therefore the accuracy of speaker identity recognition and the stability of recognition of the speaker confirmation neural network are improved.
Further, the constructing of the speaker verification neural network specifically includes:
obtaining a voice sample;
extracting acoustic features of the voice sample;
and inputting the acoustic features into an LSTM network to learn the speaker feature representation of the voice sample, and obtaining the speaker confirmation neural network.
The beneficial effect of adopting the further scheme is that: extracting acoustic features of the voice sample; the acoustic features are sent to an LSTM network for learning the acoustic features of the speaker in the voice sample, and a simple speaker verification neural network capable of extracting speaker feature representation can be obtained.
Further, the training samples include a speech sample to be recognized, a positive training sample, a negative training sample for comparison, and an auxiliary training sample for supplementing the number of the positive training samples, and the selecting the training samples based on different speech subsets of different speakers specifically includes:
selecting N different speakers, wherein the different speakers comprise a target speaker and N-1 comparison speakers, each speaker in the different speakers selects N-1 voice subsets, and each voice subset comprises M sentences of voices;
selecting a voice subset from the voice subsets of the target speaker as a target voice subset, and selecting a sentence of voice from the target voice subset as the voice sample to be recognized; and using other voices in the target voice subset as the positive training samples;
using the other voice subsets except the target voice subset in the voice subsets of the target speaker as the auxiliary training samples;
and selecting one voice subset from the voice subsets of the comparison speakers as the negative training sample.
The beneficial effect of adopting the further scheme is that: by adding the auxiliary training samples in the selection of the training samples, the original one-to-one or one-to-many training mode of the positive training samples and the negative training samples is changed into many-to-many, namely the number of the positive training samples and the auxiliary training samples is equal to that of the negative training samples, so that the training samples for confirming the neural network by the speaker are more balanced, and the accuracy and the speed for confirming the neural network by the trained speaker are improved.
Further, the determining an extended similarity matrix according to the training sample specifically includes:
obtaining a positive training sample center, a negative training sample center and an auxiliary training sample center according to the positive training sample, the negative training sample and the auxiliary training sample;
obtaining a distance value between the voice sample to be recognized and the center of the forward training sample according to the voice sample to be recognized and the center of the forward training sample, and constructing a vector matrix based on the distance value between the voice sample to be recognized and the center of the forward training sample;
acquiring a distance value between the voice sample to be recognized and the center of the negative training sample, and constructing a similarity matrix of the negative training sample based on the distance value between the voice sample to be recognized and the center of the negative training sample;
combining the vector matrix and the negative training sample similarity matrix into a positive and negative similarity matrix;
acquiring a distance value between the voice sample to be recognized and the center of the auxiliary training sample, and establishing an auxiliary similarity matrix based on the distance value between the voice sample to be recognized and the center of the auxiliary training sample;
and obtaining the extended similarity matrix according to the positive and negative similarity matrix and the auxiliary similarity matrix.
The beneficial effect of adopting the further scheme is that: by constructing the extended similarity matrix, the auxiliary training sample can be integrated into the training of the speaker confirmation neural network so as to train and obtain a speaker confirmation neural network model with greatly improved recognition accuracy and speed.
Further, the method further comprises:
and constructing a loss function, and carrying out optimization convergence on the speaker confirmation neural network based on the loss function.
Further, the expression of the loss function is:
Figure BDA0002061772830000041
wherein e isi,oRepresenting the ith speech sample in the speech subset of the target speaker o to be recognized in the speech samples to be recognized; n represents the number of different speakers; k represents the kth auxiliary speech subset in the speech sample to be recognized; j represents the jth speech subset in the negative training sample; σ denotes sigmoid function, Si,ok,assRepresenting the distance value between the speech sample to be recognized and the center of the auxiliary training sample of the kth auxiliary speech subset, Si,oi,posRepresenting the distance value, S, between the speech sample to be recognized and the center of the training samplei,oj,negRepresenting the distance value between the voice sample to be recognized and the negative training sample center of the jth voice subset; alpha is a regulatory factor.
The beneficial effect of adopting the further scheme is that: and selecting the optimal sample center distance through the loss function, so that the speaker confirmation neural network can be quickly converged, and the calculation amount of the network is reduced.
In order to solve the above technical problem, an embodiment of the present invention further provides a device for confirming the identity of a speaker, including:
the acquisition module is used for acquiring the trained speaker confirmation neural network;
and the recognition module is used for inputting the voice of the speaker to be recognized and the voice database of the speaker to be recognized into the trained speaker confirmation neural network and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized.
To solve the above technical problem, an embodiment of the present invention further provides a computer-readable storage medium, including instructions, which when executed on a computer, cause the computer to execute the speaker identity verification method according to any one of the above embodiments.
In order to solve the above technical problem, an embodiment of the present invention further provides a speaker identity verification apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the speaker identity verification method according to any one of the above embodiments when executing the program.
Drawings
Fig. 1 is a schematic flowchart of a method for validating the identity of a speaker according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a negative training sample center, an auxiliary training sample center, and a positive training sample center according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a construction of an extended similarity matrix according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a mapping relationship between a positive training sample, a negative training sample, and an auxiliary training sample according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a speaker identity verification apparatus according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a schematic flowchart of a method for confirming the identity of a speaker according to an embodiment of the present invention, as shown in fig. 1, in this embodiment, a method for confirming the identity of a speaker includes the following steps:
acquiring a trained speaker confirmation neural network;
inputting the voice of the speaker to be recognized and the voice database of the speaker to the trained speaker confirmation neural network, and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized; wherein the speaker voice database includes a plurality of different voices of a plurality of different speakers.
The trained speaker validation neural network is used for speaker validation by obtaining the trained speaker validation neural network. And identifying the identity of the speaker corresponding to the voice of the speaker to be identified by inputting the voice of the speaker to be identified and the database of the voice of the speaker to be identified into the trained speaker confirmation neural network.
It should be noted that the speaker voice database is a voice database composed of a plurality of different voices of a plurality of different speakers, and the recognition process is to match the voice of the speaker to be recognized with the voice sample in the speaker voice database so as to confirm the identity of the speaker corresponding to the voice of the speaker to be recognized.
In the above embodiment, the identity of the speaker corresponding to the voice of the speaker to be recognized is recognized by acquiring the trained speaker validation neural network and inputting the voice of the speaker to be recognized and the speaker voice database to the trained speaker validation neural network. The speaker confirmation can be carried out based on the trained speaker confirmation neural network, so that the accuracy and the stability of speaker identity recognition of the speaker confirmation neural network are improved.
Optionally, the method further includes a step of training the speaker validation neural network in advance, specifically including:
constructing a speaker confirmation neural network for extracting speaker characteristic representation;
selecting training samples based on different speech subsets of different speakers;
determining an expansion similarity matrix according to the training sample;
and training the speaker confirmation neural network based on the training sample and the extended similarity matrix to obtain the trained speaker confirmation neural network.
In the above embodiment, the method includes constructing the speaker validation neural network, selecting the training sample, determining the extended similarity matrix according to the training sample, and training the speaker validation neural network based on the training sample and the extended similarity matrix, so as to obtain the trained speaker validation neural network, so that speaker validation can be performed based on the trained speaker validation neural network, thereby improving accuracy and stability of speaker identity recognition by the speaker validation neural network.
Specifically, the constructing of the speaker verification neural network specifically includes:
obtaining a voice sample;
extracting acoustic features of the voice sample;
and inputting the acoustic features into an LSTM network to learn the speaker feature representation of the voice sample, and obtaining the speaker confirmation neural network.
It is worth mentioning that the obtaining of the voice sample further comprises preprocessing the voice sample. The method comprises the steps of performing frame windowing on each voice sample, taking 25ms as one frame, moving the frame to 10ms, and taking the first 180 frames and the second 180 frames of each voice sample as input data. The filter bank acoustic features of each frame are then extracted, and the acoustic features of the speech samples are then fed into a 3-layer LSTM (Long-Short Term Memory, LSTM) network, which is a time Recurrent Neural Network (RNN), to learn a speaker feature representation of the speech samples.
In the above embodiment, the acoustic features of the voice sample are extracted; the acoustic features are sent to an LSTM network for learning the acoustic features of the speaker in the speech sample, and a simple speaker verification neural network can be obtained.
Specifically, the training samples include a speech sample to be recognized, a positive training sample, a negative training sample for comparison, and an auxiliary training sample for supplementing the number of the positive training samples, and the selecting the training samples based on different speech subsets of different speakers specifically includes:
selecting N different speakers, wherein the different speakers comprise a target speaker and N-1 comparison speakers, each speaker in the different speakers selects N-1 voice subsets, and each voice subset comprises M sentences of voices;
selecting a voice subset from the voice subsets of the target speaker as a target voice subset, and selecting a sentence of voice from the target voice subset as the voice sample to be recognized; and using other voices in the target voice subset as the positive training samples;
using the other voice subsets except the target voice subset in the voice subsets of the target speaker as the auxiliary training samples;
and selecting one voice subset from the voice subsets of the comparison speakers as the negative training sample.
Because the conventional sample selection method for training the speaker verification neural network is one-to-one or one-to-many, that is, one positive training sample to one negative training sample or one positive training sample to a plurality of negative training samples, the recognition accuracy of the neural network is reduced due to imbalance of the training samples, the auxiliary training samples are introduced into the selection of the training samples, and the auxiliary training samples are used for supplementing the number of the positive training samples, so that the original one-to-many training method is changed into many-to-many, for example, three auxiliary training samples are introduced, and the training method is that four negative training samples are paired from one positive training sample to three auxiliary training samples.
It should be noted that the sampling method of the training samples is as follows: n different speakers are randomly selected, including a target speaker and (N-1) contrasted speakers, wherein each speaker randomly selects (N-1) speech subsets, and each speech subset contains M sentences of speech. Randomly selecting a sentence of voice in a voice subset of a target speaker as a voice sample to be recognized, wherein other voices in the voice subset are used as positive training samples, and the rest (N-2) voice subsets are used as auxiliary voice subsets and are called auxiliary training samples. In addition, the auxiliary training samples are used to supplement the number of positive training samples, such that the number of positive training samples plus the auxiliary training samples equals the number of negative training samples.
In the above embodiment, the auxiliary training samples are added in the selection of the training samples, so that the original one-to-one or one-to-many training mode is changed into many-to-many, and the training samples are more balanced, thereby improving the accuracy and the speed of the trained speaker in confirming the neural network recognition.
Specifically, the determining an extended similarity matrix according to the training sample specifically includes:
obtaining a positive training sample center, a negative training sample center and an auxiliary training sample center according to the positive training sample, the negative training sample and the auxiliary training sample;
obtaining a distance value between the voice sample to be recognized and the center of the forward training sample according to the voice sample to be recognized and the center of the forward training sample, and constructing a vector matrix based on the distance value between the voice sample to be recognized and the center of the forward training sample;
acquiring a distance value between the voice sample to be recognized and the center of the negative training sample, and constructing a similarity matrix of the negative training sample based on the distance value between the voice sample to be recognized and the center of the negative training sample;
combining the vector matrix and the negative training sample similarity matrix into a positive and negative similarity matrix;
acquiring a distance value between the voice sample to be recognized and the center of the auxiliary training sample, and establishing an auxiliary similarity matrix based on the distance value between the voice sample to be recognized and the center of the auxiliary training sample;
and obtaining the extended similarity matrix according to the positive and negative similarity matrix and the auxiliary similarity matrix.
As shown in fig. 2, sample centers of the positive training sample, the negative training sample, and the auxiliary training sample are calculated respectively to obtain a positive training sample center, a negative training sample center, and an auxiliary training sample center; wherein the function of calculating the center of the negative training sample is as follows:
Figure BDA0002061772830000091
wherein, cj,o,negRepresenting a negative training sample center, where o represents a targeted speaker; j represents the voice subset of the jth contrasted speaker in the negative training sample; e.g. of the typejmRepresenting the mth sentence of speech in the subset of speech in the jth negative training sample; m denotes that the speech subset of each speaker in the negative training sample has M sentences of speech.
Wherein the function for calculating the center of the auxiliary training sample is as follows:
Figure BDA0002061772830000092
ck,o,assrepresenting an assisted training sample center, wherein o represents a targeted speaker; k represents the kth auxiliary speech subset in the auxiliary training sample; e.g. of the typekmRepresenting the mth sentence of speech in the kth auxiliary speech subset in the auxiliary training sample; m represents that the speech subsets in the auxiliary training sample have M sentences of speech.
Wherein the function of calculating the center of the positive training sample is as follows:
Figure BDA0002061772830000101
in the formula, the first and second organic solvents are,
Figure BDA0002061772830000102
representing a positive training sample center; where o represents the target speaker and (-i) represents the exclusion of the ith sentence of speech; e.g. of the typemRepresenting the mth sentence voice in the voice subset of the speaker to be recognized; m represents that the speech subset of the positive training samples has M sentences of speech.
It should be noted that the formula for calculating the distance value between the speech sample to be recognized and the center of the training sample is:
Figure BDA0002061772830000103
Si,oi,posrepresenting the distance between the voice sample to be recognized and the center of the training sample, wherein i belongs to (1, M); w and b are learning parameters.
The calculation formula for calculating the distance value between the voice sample to be recognized and the center of the negative training sample is as follows:
Si,oj,neg=w·cos(ei,o,cj,o,neg)+b
wherein S isi,oj,negRepresenting the distance between the voice sample to be recognized and the j th negative training sample center, wherein i belongs to (1, M); j ∈ (1, N-1); w and b are learning parameters.
The formula for calculating the distance value between the voice sample to be recognized and the center of the auxiliary training sample is as follows:
Si,ok,ass=w·cos(ei,o,ck,o,ass)+b
Si,ok,assrepresenting the distance between the voice sample to be recognized and the center of the kth auxiliary training sample; wherein i ∈ (1, M); k ∈ (1, N-2); w and b are learning parameters.
Fig. 3 is a schematic diagram illustrating a construction of an extended similarity matrix according to an embodiment of the present invention, and as shown in fig. 3, first, a distance value between the to-be-recognized speech sample and the center of the training sample is calculated, and a vector matrix is constructed based on the distance value between the to-be-recognized speech sample and the center of the training sample;
then, calculating a distance value between the voice sample to be recognized and the center of the negative training sample, and constructing a similarity matrix of the negative training sample based on the distance value between the voice sample to be recognized and the center of the negative training sample;
then, combining the vector matrix and the negative training sample similarity matrix into a positive and negative similarity matrix;
then, calculating a distance value between the voice sample to be recognized and the center of the auxiliary training sample, and establishing an auxiliary similarity matrix based on the distance value between the voice sample to be recognized and the center of the auxiliary training sample;
and finally, obtaining the extended similarity matrix according to the positive and negative similarity matrix and the auxiliary similarity matrix.
Namely, the distance value between the voice sample to be recognized and the center of the positive training sample is constructed into a vector matrix, the distance value between the voice sample to be recognized and the center of the negative training sample is constructed into a negative training sample similarity matrix, and then the vector matrix and the negative training sample similarity matrix are combined into a positive and negative similarity matrix.
In the invention, the auxiliary training sample is introduced in the selection of the training sample, so the auxiliary training sample is also blended into the similarity matrix when the similarity matrix is calculated. Firstly, calculating the distance values between the voice sample to be recognized and the center of the auxiliary training sample, and then combining the distance values between all the voice sample to be recognized and the center of the auxiliary training sample into an auxiliary similarity matrix; then, the positive and negative similarity matrixes and the auxiliary similarity matrix are combined into a new similarity matrix, namely the expanded similarity matrix. By the aid of the extended similarity matrix, the auxiliary training sample can be integrated into training in the speaker validation neural network.
Fig. 4 is a schematic diagram of a mapping relationship between positive training samples, negative training samples, and auxiliary training samples provided in an embodiment of the present invention, and as shown in fig. 4, the number of the positive training samples combined with the auxiliary training samples is equal to the number of the negative training samples by introducing the auxiliary training samples.
It should be noted that the auxiliary training samples are introduced because in the conventional training of the speaker verification neural network, the positive training samples and the negative training samples are selected one-to-one or one-to-many, which results in the imbalance of the training samples, causes the unfairness of the training samples, and makes the neural network not effective enough. Therefore, the new auxiliary training sample is introduced in the invention to make the training sample more balanced so as to improve the accuracy and speed of the speaker for confirming the neural network recognition.
Optionally, the method further comprises:
and constructing a loss function, and carrying out optimization convergence on the speaker confirmation neural network based on the loss function.
Specifically, the expression of the loss function is:
Figure BDA0002061772830000121
wherein e isi,oRepresenting the ith speech sample in the speech subset of the target speaker o to be recognized in the speech samples to be recognized; n represents the number of different speakers; k represents the kth auxiliary speech subset in the speech sample to be recognized; j represents the jth speech subset in the negative training sample; σ denotes sigmoid function, Si,ok,assRepresenting the distance value between the speech sample to be recognized and the center of the auxiliary training sample of the kth auxiliary speech subset, Si,oi,posRepresenting the distance value, S, between the speech sample to be recognized and the center of the training samplei,oj,negRepresenting the distance value between the voice sample to be recognized and the negative training sample center of the jth voice subset; alpha is a regulatory factor.
Because the auxiliary training sample is introduced into the training sample, a new loss function needs to be designed, and the minimum value between the voice sample to be recognized and the center of the auxiliary training sample or the center of the training sample, namely the center point of the speaker farthest from the voice sample to be recognized, is selected through the loss function; and the center point of the contrasted speaker closest to the center of the negative training sample, namely the center point of the contrasted speaker closest to the voice sample to be recognized, participates in the calculation of the loss function. The loss function selects the optimal sample center distance, so that the network can be converged quickly, and more accurate speaker recognition is realized.
Specifically, the overall ATS-GE2E loss value is L (e)oi) The specific mathematical expression is as follows:
Figure BDA0002061772830000122
wherein, (o is belonged to (1, N), i is belonged to (1, M).
Meanwhile, as shown in fig. 5, an embodiment of the present invention further provides a speaker identity verification apparatus, including:
the acquisition module is used for acquiring the trained speaker confirmation neural network;
and the recognition module is used for inputting the voice of the speaker to be recognized and the voice database of the speaker to be recognized into the trained speaker confirmation neural network and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized.
Meanwhile, an embodiment of the present invention further provides a computer-readable storage medium, which includes instructions, and when the instructions are executed on a computer, the instructions cause the computer to execute the speaker identity verification method according to any one of the above embodiments.
Meanwhile, the embodiment of the invention also provides a speaker identity confirmation device, which comprises a memory, a processor and a computer program which is stored on the memory and can be run on the processor, wherein when the processor executes the program, the speaker identity confirmation method in any one of the above embodiments is realized.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A method for speaker identity verification, comprising the steps of:
acquiring a trained speaker confirmation neural network;
inputting the voice of the speaker to be recognized and the voice database of the speaker to the trained speaker confirmation neural network, and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized; wherein the speaker voice database comprises a plurality of different voices of a plurality of different speakers;
the method also comprises the step of training the speaker to confirm the neural network in advance, and specifically comprises the following steps:
constructing a speaker confirmation neural network for extracting speaker characteristic representation;
selecting training samples based on different speech subsets of different speakers;
determining an expansion similarity matrix according to the training sample;
training the speaker confirmation neural network based on the training sample and the extended similarity matrix to obtain the trained speaker confirmation neural network;
the method includes the steps of selecting training samples based on different speech subsets of different speakers, wherein the training samples include a speech sample to be recognized, a positive training sample, a negative training sample for comparison, and an auxiliary training sample for supplementing the number of the positive training samples, and specifically includes:
selecting N different speakers, wherein the different speakers comprise a target speaker and N-1 comparison speakers, each speaker in the different speakers selects N-1 voice subsets, and each voice subset comprises M sentences of voices;
selecting a voice subset from the voice subsets of the target speaker as a target voice subset, and selecting a sentence of voice from the target voice subset as the voice sample to be recognized; and using other voices in the target voice subset as the positive training samples;
using the other voice subsets except the target voice subset in the voice subsets of the target speaker as the auxiliary training samples;
selecting a voice subset from the voice subsets of the contrasted speakers as the negative training sample;
wherein, the determining an extended similarity matrix according to the training samples specifically includes:
obtaining a positive training sample center, a negative training sample center and an auxiliary training sample center according to the positive training sample, the negative training sample and the auxiliary training sample;
obtaining a distance value between the voice sample to be recognized and the center of the forward training sample according to the voice sample to be recognized and the center of the forward training sample, and constructing a vector matrix based on the distance value between the voice sample to be recognized and the center of the forward training sample;
acquiring a distance value between the voice sample to be recognized and the center of the negative training sample, and constructing a similarity matrix of the negative training sample based on the distance value between the voice sample to be recognized and the center of the negative training sample;
combining the vector matrix and the negative training sample similarity matrix into a positive and negative similarity matrix;
acquiring a distance value between the voice sample to be recognized and the center of the auxiliary training sample, and establishing an auxiliary similarity matrix based on the distance value between the voice sample to be recognized and the center of the auxiliary training sample;
and obtaining the extended similarity matrix according to the positive and negative similarity matrix and the auxiliary similarity matrix.
2. The method for confirming the identity of a speaker according to claim 1, wherein the constructing of the speaker confirmation neural network specifically comprises:
obtaining a voice sample;
extracting acoustic features of the voice sample;
and inputting the acoustic features into an LSTM network to learn the speaker feature representation of the voice sample, and obtaining the speaker confirmation neural network.
3. The speaker identity verification method according to claim 1 or 2, wherein the method further comprises:
and constructing a loss function, and carrying out optimization convergence on the speaker confirmation neural network based on the loss function.
4. The speaker identity verification method of claim 3, wherein the loss function is expressed as:
Figure FDA0002933511070000031
wherein e isi,oRepresenting the ith speech sample in the speech subset of the target speaker o to be recognized in the speech samples to be recognized; n represents the number of different speakers; k represents the kth auxiliary speech subset in the speech sample to be recognized; j represents the jth speech subset in the negative training sample; σ denotes sigmoid function, Si,ok,assRepresenting the distance value between the speech sample to be recognized and the center of the auxiliary training sample of the kth auxiliary speech subset, Si,oi,posRepresenting the distance value, S, between the speech sample to be recognized and the center of the training samplei,oj,negRepresenting the distance value between the voice sample to be recognized and the negative training sample center of the jth voice subset; alpha is a regulatory factor.
5. A speaker identity verification apparatus, comprising:
the acquisition module is used for acquiring the trained speaker confirmation neural network;
the recognition module is used for inputting the voice of the speaker to be recognized and the voice database of the speaker to be recognized into the trained speaker confirmation neural network and recognizing the identity of the speaker corresponding to the voice of the speaker to be recognized;
the acquisition module is further used for training the speaker to confirm the neural network in advance, and specifically comprises:
constructing a speaker confirmation neural network for extracting speaker characteristic representation;
selecting training samples based on different speech subsets of different speakers;
determining an expansion similarity matrix according to the training sample;
training the speaker confirmation neural network based on the training sample and the extended similarity matrix to obtain the trained speaker confirmation neural network;
the method includes the steps of selecting training samples based on different speech subsets of different speakers, wherein the training samples include a speech sample to be recognized, a positive training sample, a negative training sample for comparison, and an auxiliary training sample for supplementing the number of the positive training samples, and specifically includes:
selecting N different speakers, wherein the different speakers comprise a target speaker and N-1 comparison speakers, each speaker in the different speakers selects N-1 voice subsets, and each voice subset comprises M sentences of voices;
selecting a voice subset from the voice subsets of the target speaker as a target voice subset, and selecting a sentence of voice from the target voice subset as the voice sample to be recognized; and using other voices in the target voice subset as the positive training samples;
using the other voice subsets except the target voice subset in the voice subsets of the target speaker as the auxiliary training samples;
selecting a voice subset from the voice subsets of the contrasted speakers as the negative training sample;
wherein, the determining an extended similarity matrix according to the training samples specifically includes:
obtaining a positive training sample center, a negative training sample center and an auxiliary training sample center according to the positive training sample, the negative training sample and the auxiliary training sample;
obtaining a distance value between the voice sample to be recognized and the center of the forward training sample according to the voice sample to be recognized and the center of the forward training sample, and constructing a vector matrix based on the distance value between the voice sample to be recognized and the center of the forward training sample;
acquiring a distance value between the voice sample to be recognized and the center of the negative training sample, and constructing a similarity matrix of the negative training sample based on the distance value between the voice sample to be recognized and the center of the negative training sample;
combining the vector matrix and the negative training sample similarity matrix into a positive and negative similarity matrix;
acquiring a distance value between the voice sample to be recognized and the center of the auxiliary training sample, and establishing an auxiliary similarity matrix based on the distance value between the voice sample to be recognized and the center of the auxiliary training sample;
and obtaining the extended similarity matrix according to the positive and negative similarity matrix and the auxiliary similarity matrix.
6. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform a method for speaker identity verification according to any of claims 1 to 4.
7. A speaker identity verification device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the speaker identity verification method according to any one of claims 1 to 4.
CN201910407670.8A 2019-05-15 2019-05-15 Speaker identity confirmation method, device and storage medium Active CN110223699B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910407670.8A CN110223699B (en) 2019-05-15 2019-05-15 Speaker identity confirmation method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910407670.8A CN110223699B (en) 2019-05-15 2019-05-15 Speaker identity confirmation method, device and storage medium

Publications (2)

Publication Number Publication Date
CN110223699A CN110223699A (en) 2019-09-10
CN110223699B true CN110223699B (en) 2021-04-13

Family

ID=67821242

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910407670.8A Active CN110223699B (en) 2019-05-15 2019-05-15 Speaker identity confirmation method, device and storage medium

Country Status (1)

Country Link
CN (1) CN110223699B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112712792A (en) * 2019-10-25 2021-04-27 Tcl集团股份有限公司 Dialect recognition model training method, readable storage medium and terminal device
CN111712874B (en) 2019-10-31 2023-07-14 支付宝(杭州)信息技术有限公司 Method, system, device and storage medium for determining sound characteristics
CN111429918A (en) * 2020-03-26 2020-07-17 云知声智能科技股份有限公司 Phone call fraud visiting method and system based on voiceprint recognition and intention analysis

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106683680A (en) * 2017-03-10 2017-05-17 百度在线网络技术(北京)有限公司 Speaker recognition method and device and computer equipment and computer readable media
CN107180628A (en) * 2017-05-19 2017-09-19 百度在线网络技术(北京)有限公司 Set up the method, the method for extracting acoustic feature, device of acoustic feature extraction model
US20180039888A1 (en) * 2016-08-08 2018-02-08 Interactive Intelligence Group, Inc. System and method for speaker change detection
US20180082691A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Dimensionality reduction of baum-welch statistics for speaker recognition
CN108417217A (en) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 Speaker Identification network model training method, method for distinguishing speek person and system
CN109166586A (en) * 2018-08-02 2019-01-08 平安科技(深圳)有限公司 A kind of method and terminal identifying speaker
CN109243467A (en) * 2018-11-14 2019-01-18 龙马智声(珠海)科技有限公司 Sound-groove model construction method, method for recognizing sound-groove and system
CN109256135A (en) * 2018-08-28 2019-01-22 桂林电子科技大学 A kind of end-to-end method for identifying speaker, device and storage medium
US20190035431A1 (en) * 2017-07-28 2019-01-31 Adobe Systems Incorporated Apparatus, systems, and methods for integrating digital media content
CN109326294A (en) * 2018-09-28 2019-02-12 杭州电子科技大学 A kind of relevant vocal print key generation method of text

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180039888A1 (en) * 2016-08-08 2018-02-08 Interactive Intelligence Group, Inc. System and method for speaker change detection
US20180082691A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Dimensionality reduction of baum-welch statistics for speaker recognition
CN106683680A (en) * 2017-03-10 2017-05-17 百度在线网络技术(北京)有限公司 Speaker recognition method and device and computer equipment and computer readable media
CN107180628A (en) * 2017-05-19 2017-09-19 百度在线网络技术(北京)有限公司 Set up the method, the method for extracting acoustic feature, device of acoustic feature extraction model
US20190035431A1 (en) * 2017-07-28 2019-01-31 Adobe Systems Incorporated Apparatus, systems, and methods for integrating digital media content
CN108417217A (en) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 Speaker Identification network model training method, method for distinguishing speek person and system
CN109166586A (en) * 2018-08-02 2019-01-08 平安科技(深圳)有限公司 A kind of method and terminal identifying speaker
CN109256135A (en) * 2018-08-28 2019-01-22 桂林电子科技大学 A kind of end-to-end method for identifying speaker, device and storage medium
CN109326294A (en) * 2018-09-28 2019-02-12 杭州电子科技大学 A kind of relevant vocal print key generation method of text
CN109243467A (en) * 2018-11-14 2019-01-18 龙马智声(珠海)科技有限公司 Sound-groove model construction method, method for recognizing sound-groove and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"基于三元组损失与流形降维的文本无关说话人识别方法研究";刘崇鸣;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20200215;全文 *
"Centroid-based deep metric learning for speaker recognition";Jixuan Wang 等;《https://arxiv.org/abs/1902.02375》;20190206;全文 *
B. Li 等."Threshold Re-weighting Attention Mechanism for Speaker Verification".《2018 IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC)》.2019, *

Also Published As

Publication number Publication date
CN110223699A (en) 2019-09-10

Similar Documents

Publication Publication Date Title
CN108417217B (en) Speaker recognition network model training method, speaker recognition method and system
CN110223699B (en) Speaker identity confirmation method, device and storage medium
CN107492382A (en) Voiceprint extracting method and device based on neutral net
CN106098068A (en) A kind of method for recognizing sound-groove and device
Senoussaoui et al. First attempt of boltzmann machines for speaker verification
Li et al. Towards Discriminative Representation Learning for Speech Emotion Recognition.
Liu et al. An investigation on back-end for speaker recognition in multi-session enrollment
CN108986798B (en) Processing method, device and the equipment of voice data
CN113140254B (en) Meta-learning drug-target interaction prediction system and prediction method
CN110992988B (en) Speech emotion recognition method and device based on domain confrontation
CN111723870B (en) Artificial intelligence-based data set acquisition method, apparatus, device and medium
CN111564179B (en) Species biology classification method and system based on triple neural network
CN111401105B (en) Video expression recognition method, device and equipment
CN108877812B (en) Voiceprint recognition method and device and storage medium
CN106601258A (en) Speaker identification method capable of information channel compensation based on improved LSDA algorithm
CN113361636A (en) Image classification method, system, medium and electronic device
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN108830201A (en) Acquisition methods, device, computer equipment and the storage medium of sample triple
Ma et al. LID-senone extraction via deep neural networks for end-to-end language identification
CN110717027A (en) Multi-round intelligent question-answering method, system, controller and medium
CN114398611A (en) Bimodal identity authentication method, device and storage medium
CN112052686B (en) Voice learning resource pushing method for user interactive education
CN111680514B (en) Information processing and model training method, device, equipment and storage medium
Zhang et al. Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification}}
CN115222047A (en) Model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant