CN110111798B

CN110111798B - Method, terminal and computer readable storage medium for identifying speaker

Info

Publication number: CN110111798B
Application number: CN201910354414.7A
Authority: CN
Inventors: 张丝潆; 曾庆亮; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2023-05-05
Anticipated expiration: 2039-04-29
Also published as: CN110111798A; WO2020220541A1

Abstract

The invention is applicable to the technical field of computers, and provides a method and a terminal for identifying a speaker, wherein the method comprises the following steps: acquiring audio information to be identified, which is spoken by a person to be tested for a reference digital string; the audio information includes a digital string; extracting speaker latent variables and digital latent variables of the audio information; the speaker latent variable is used for identifying characteristic information of a speaker, and the digital latent variable is used for identifying pronunciation characteristics of a to-be-tested person on numbers in the audio information; when the latent variable of the loudspeaker meets the preset requirement, inputting the digital latent variable into a preset Bayesian model for voiceprint recognition, and obtaining an identity recognition result. According to the embodiment of the invention, the speaker latent variable and the digital latent variable in the audio information are used for identifying the identity information of the speaker, so that the situation that the speaker has different pronunciations for the same number and the speaker has different pronunciations for the same number at different moments can be avoided, the identity recognition result is interfered, and the accuracy of the identity recognition result can be improved.

Description

Method, terminal and computer readable storage medium for identifying speaker

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, a terminal, and a computer readable storage medium for identifying a speaker.

Background

With the rapid development of information technology and network technology, people have more and more demands for identity recognition technology. While conventional password authentication-based identification technologies have revealed many disadvantages (such as low security reliability) in practical applications, biometric authentication-based identification technologies have also been increasingly mature in recent years and exhibit their superiority in practical applications. The voiceprint recognition technology is one of the identification technologies based on biological feature recognition.

Voiceprints refer to an information plot of the speaker's voice spectrum. Because the pronunciation organs of each person are different, the emitted sound and the tone thereof are different, and therefore, the identification by taking the voiceprint as the basic characteristic has irreplaceability and stability. Voiceprint recognition has two types of Text-Dependent and Text-Independent (Text-Independent).

Text-dependent speaker verification systems are more advantageous for security applications, as opposed to text-independent speaker recognition, where the speech content is unconstrained, because they tend to exhibit higher accuracy in short-term conversations.

A typical text-related speaker recognition is to have each user use a fixed phrase to match the registration and test phrases. For this case, the utterance from the user may be recorded in advance, and then played. Under the condition that the training and testing utterances have different scenes, the safety of recognition protection can be improved to a certain extent by sharing the same voice content, the system can randomly give a plurality of digital strings, the user can recognize the voiceprint only by correctly reciting the corresponding content, and the introduction of the randomness ensures that the voiceprint acquired each time in the text-related recognition has the difference in content time sequence. However, when a speaker is identified using a random prompt digit string, because part of the sample digits are fixed in vocabulary and the number samples are limited, and different speakers have different pronunciations for the same number, this may result in an inability to accurately identify the speaker's identity.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method and a terminal for identifying a speaker, so as to solve the problem in the prior art that, for complex voiceprint voice information (for example, short voice, simulated voice, etc.), a text-independent voiceprint recognition system cannot accurately extract the voice characteristics of the speaker, thereby failing to accurately identify the identity of the speaker.

A first aspect of an embodiment of the present invention provides a method for identifying a speaker, including:

acquiring audio information to be identified, which is spoken by a person to be tested for a reference digital string; the reference digital string is pre-stored and randomly played or randomly displayed, and the audio information comprises audio corresponding to the digital string spoken by the to-be-detected person;

extracting speaker latent variables and digital latent variables of the audio information; the digital latent variable is used for identifying the feature information of the loudspeaker, the digital latent variable is extracted when the fact that the number string spoken by the to-be-tested person is identical to the reference number string is confirmed, and the digital latent variable is used for identifying the pronunciation feature of the to-be-tested person on the number in the audio information;

when the speaker latent variable meets a preset requirement, inputting the digital latent variable into a preset Bayesian model for voiceprint recognition to obtain an identity recognition result; wherein the preset requirement is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information; the Bayesian model is obtained by training digital latent variables of each number uttered by a single speaker in a sound sample set by using a machine learning algorithm; each digital latent variable has an identity tag identifying a speaker to which the digital latent variable belongs; the bayesian model has a correspondence to the single speaker in the sound sample set.

A second aspect of an embodiment of the present invention provides a terminal, including:

the acquisition unit is used for acquiring the audio information to be identified which is spoken by the to-be-detected person aiming at the reference digital string; the reference digital string is pre-stored and randomly played or randomly displayed, and the audio information comprises audio corresponding to the digital string spoken by the to-be-detected person;

an extracting unit for extracting a speaker latent variable and a digital latent variable of the audio information; the digital latent variable is used for identifying the feature information of the loudspeaker, the digital latent variable is extracted when the fact that the number string spoken by the to-be-tested person is identical to the reference number string is confirmed, and the digital latent variable is used for identifying the pronunciation feature of the to-be-tested person on the number in the audio information;

the identification unit is used for inputting the digital latent variable into a preset Bayesian model for voiceprint identification when the speaker latent variable meets the preset requirement, so as to obtain an identity identification result; the Bayesian model is obtained by training digital latent variables of each number uttered by a single speaker in a sound sample set by using a machine learning algorithm; each digital latent variable is provided with an identity tag for identifying a speaker to which the digital latent variable corresponds; the bayesian model has a correspondence to the single speaker in the sound sample set.

A third aspect of an embodiment of the present invention provides a terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

A fourth aspect of embodiments of the present invention provides a computer readable storage medium storing a computer program which when executed by a processor performs the steps of:

The method and the terminal for identifying the speaker provided by the embodiment of the invention have the following beneficial effects:

according to the embodiment of the invention, the speaker latent variable and the digital latent variable of the audio information to be identified are extracted; when the latent variable of the loudspeaker meets the requirement, inputting the digital latent variable into a preset Bayesian model for voiceprint recognition to obtain an identity recognition result. Because the preset requirement is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information, when the speaker latent variable in the audio information meets the preset requirement, the interference of the performance of the speaker on the identity recognition result can be eliminated, at the moment, the identity information of the speaker is recognized based on the digital latent variable of each number uttered by the person to be tested, and because the number of the digital latent variables of each number can be multiple, the identity of the speaker can be accurately recognized even if the speaker utters the same number at different moments, the situation that different speakers utter the same number at different moments can be avoided, and the condition that the identity recognition result is interfered can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an implementation of a method for speaker identification according to an embodiment of the present invention;

FIG. 2 is a flow chart of an implementation of a method for speaker identification according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of null hypothesis and alternative hypotheses provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a terminal according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a terminal according to another embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for identifying a speaker according to an embodiment of the present invention. The execution subject of the method for identifying a speaker in this embodiment is a terminal. The terminal comprises, but is not limited to, a mobile terminal such as a smart phone, a tablet computer, a wearable device and the like, and can also be a desktop computer and the like. The method of identifying a speaker as shown may include:

s101: acquiring audio information to be identified, which is spoken by a person to be tested for a reference digital string; the reference digital string is pre-stored, randomly played or randomly displayed, and the audio information comprises audio corresponding to the digital string spoken by the to-be-detected person.

When the terminal detects a speaker identification instruction, the terminal can acquire audio information to be identified, which is sent by a speaker in the surrounding environment, through a built-in sound pickup device (such as a microphone and a loudspeaker), and at the moment, the audio information sent by the speaker is sent according to a reference digital string randomly given by the terminal; or the terminal acquires the audio file or the video file corresponding to the file identification according to the file identification contained in the speaker identification instruction, extracts the audio information in the audio file or the video file, and identifies the audio information as the audio information to be identified. The audio file or the video file contains the audio information obtained by the tested person reciting the reference digital string, and the audio file or the video file can be uploaded by a user or can be downloaded from a server for storing the audio file or the video file. The reference number string is pre-stored in the terminal and randomly played or randomly displayed by the terminal. The number of reference number strings may include a plurality.

The audio information to be identified includes audio corresponding to a number string composed of at least one number.

S102: extracting speaker latent variables and digital latent variables of the audio information; the method comprises the steps that a speaker latent variable is extracted when the fact that a number string uttered by a person to be detected is identical to a reference number string is confirmed, the digital latent variable is used for identifying characteristic information of a speaker, and the digital latent variable is used for identifying pronunciation characteristics of the person to be detected to numbers in audio information.

The terminal calculates speaker variables of the audio information based on the acquired audio information. Speaker latent variables include, but are not limited to, signal-to-noise ratio, and may also include efficiency, sound pressure level, etc. of the speaker.

The signal-to-noise ratio (snr) is a parameter describing the proportional relationship between the active component and the noise component in a signal. Wherein the higher the signal to noise ratio of the speaker, the more clear the sound picked up by the speaker. For example, the terminal extracts a normal sound signal and a noise signal when there is no signal from the audio information, and calculates the signal-to-noise ratio of the audio information based on the normal sound signal and the noise signal.

The terminal may input the acquired audio information into a pre-trained deep neural network (Deep Neural Networks, DNN) model, and extract digital latent variables of each number in the audio information through the deep neural network. The number latent variable is used to identify the pronunciation characteristics of the same number. The same number may have at least two different numerical latent variables, i.e., at least two different pronunciations. For example, the pronunciation of the number "1" includes "one", "unitary", and the like.

In this embodiment, the deep neural network model includes an input layer, an hidden layer, and an output layer. The input layer includes an input layer node for receiving input audio information from the outside. The hidden layer comprises more than two hidden layer nodes and is used for processing the audio information and extracting digital latent variables of the audio information. The output layer is used for outputting a processing result, namely outputting digital latent variables of the audio information.

The deep neural network model is trained based on a sound sample set, wherein the sound sample set comprises sound samples corresponding to each number spoken by a speaker. The number of speakers may be multiple, such as 500 or 1500, the more samples that are trained to some extent, the more accurate the results are when identified using the trained neural network model. The sound samples comprise a preset number of sound samples, each sound sample is provided with a marked digital latent variable, and each digital latent variable corresponds to a sample label one by one. The sound sample may include only one number, and each number in the sound sample set may correspond to at least two digital latent variables. The sound sample includes numbers that include all of the reference number strings that may be randomly generated. For example, when the randomly generated reference number string includes any 6 numbers of 0 to 9 and 10 numbers, the sound sample set includes sound samples of 10 numbers of 0 to 9.

In the training process, input data of the input depth neural network model is audio information, the audio information can be a vector matrix obtained based on the audio information, and the vector matrix is composed of vectors corresponding to audio data containing numbers, which are sequentially extracted from the audio information. The output data of the deep neural network model is a digital latent variable of each digit in the audio information.

It can be appreciated that one speaker corresponds to one neural network recognition model, and when a plurality of speakers need to be recognized, the neural network recognition model corresponding to each of the plurality of speakers is trained.

When the terminal extracts the speaker latent variable, judging whether the extracted speaker latent variable meets the preset requirement or not based on the preset condition. The preset condition is used for judging whether the sound in the audio information is clear and can be recognized, and the preset condition can be set based on the value of the speaker latent variable corresponding to the clear and recognizable sound. For example, when the speaker latent amount is a signal-to-noise ratio, the preset condition may be that the signal-to-noise ratio is greater than or equal to a preset signal-to-noise ratio threshold. The preset signal-to-noise ratio threshold is the signal-to-noise ratio corresponding to the clearly recognizable sound.

When the speaker latent amount meets the requirement, S103 is executed; when the speaker latent variable does not meet the requirement, the method can return to S101, and can output an identification result that the current speaker is not matched with the speaker corresponding to the reference digital string.

S103: when the speaker latent variable meets a preset requirement, inputting the digital latent variable into a preset Bayesian model for voiceprint recognition to obtain an identity recognition result; wherein the preset requirement is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information; the Bayesian model is obtained by training digital latent variables of each number uttered by a single speaker in a sound sample set by using a machine learning algorithm; each digital latent variable has an identity tag identifying a speaker to which the digital latent variable belongs; the bayesian model has a correspondence to the single speaker in the sound sample set.

For example, when a speaker speaks that a number has the number latent variable, then the identity tag of the number latent variable represents: the identity of the speaker of the digit is spoken.

In this embodiment, the expression of the preset bayesian model is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>

；

。

Wherein,,

representing the probability that a person speaks a number,/->

Indicating that the ith person has spoken the jth number in the kth session. Since the speaker may be required to speak a plurality of different number strings when performing authentication or entering information, k represents the kth session, and θ is a generic term for parameters in the bayesian model. Conditional probability

Representing the coincidence mean as +.>

Variance is->

Is a gaussian distribution of (c); />

Representing the diagonal covariance of Ɛ; signal component->

=/>

I.e. representing the signal component->

Depending on speaker and number; noise component Ɛ _ijk Represents the deviation or noise of the jth number spoken by the ith person in the kth session, μ represents the overall average of the training vectors.

Digital latent variable representing speaker i, +.>

Is defined as having diagonal covariance +>

Is (1) Gauss->

Representing the probability that the speaker is i, +.>

Is about->

Is a gaussian distribution of (c); />

The digital latent variable representing speaker j is +.>

Is defined as having diagonal covariance +>

Is (1) Gauss->

Probability of representing speaker j, +.>

Is about

Is a gaussian distribution of (c).

Formally, a Bayesian model can describe N (x|μ, Σ) with conditional probabilities representing the Gaussian in x, with average μ and covariance Σ.

And when the terminal completes iterative calculation of the respective corresponding probability of each number contained in the audio information, the terminal performs identity recognition on the speaker based on the respective corresponding probability of each number obtained by each iterative calculation, and an identity recognition result is obtained.

For example, when the total number of iterations is 10, the probability that the speaker i speaks a number is greater than or equal to a preset probability threshold (for example, 0.8), 1 score is recorded, when the probability that the speaker i speaks a number is less than the preset probability threshold (for example, 0.8), 0 score is recorded, the total score of the speaker i after 10 iterative computations is counted, and when the total score is greater than or equal to the preset score threshold (for example, 7 score), the speaker corresponding to the audio information is determined as the speaker i.

According to the embodiment of the invention, the speaker latent variable and the digital latent variable of the audio information to be identified are extracted; when the latent variable of the loudspeaker meets the requirement, inputting the digital latent variable into a preset Bayesian model for voiceprint recognition to obtain an identity recognition result. Because the preset requirement is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information, when the speaker latent variable in the audio information meets the preset requirement, the interference on the identity recognition result caused by the difference of the digital pronunciation factors of the speaker can be eliminated, and the identity information of the speaker can be recognized based on the digital latent variable of each number uttered by the person to be tested.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for identifying a speaker according to another embodiment of the present invention. The execution subject of the method for identifying a speaker in this embodiment is a terminal. The terminal comprises, but is not limited to, a mobile terminal such as a smart phone, a tablet computer, a wearable device and the like, and can also be a desktop computer and the like. The method for identifying a speaker of the present embodiment includes the following steps:

S201: acquiring audio information to be identified, which is spoken by a person to be tested for a reference digital string; the reference digital string is pre-stored, randomly played or randomly displayed, and the audio information comprises audio corresponding to the digital string spoken by the to-be-detected person.

In this embodiment, S201 is the same as S101 in the previous embodiment, and please refer to the description related to S101 in the previous embodiment, which is not repeated here.

S202: extracting speaker latent variables and digital latent variables of the audio information; the method comprises the steps that a speaker latent variable is extracted when the fact that a number string uttered by a person to be detected is identical to a reference number string is confirmed, the digital latent variable is used for identifying characteristic information of a speaker, and the digital latent variable is used for identifying pronunciation characteristics of the person to be detected to numbers in audio information.

In this embodiment, S202 is the same as S102 in the previous embodiment, and please refer to the description related to S102 in the previous embodiment, which is not repeated here.

Further, S202 may include S2021-S2023. The method comprises the following steps:

s2021: and extracting a speaker latent variable from the audio information, and detecting whether the audio information is qualified or not based on the value of the speaker latent variable.

Specifically, the terminal may extract a speaker latent variable from the audio information to be identified, and detect whether the audio information is qualified based on the value of the extracted speaker latent variable and a preset speaker latent variable threshold, so as to confirm whether the sound in the audio information is clear and can be recognized. The speaker latent variable is used to identify characteristic information of the speaker. Speaker latent variables include, but are not limited to, signal-to-noise ratio, and may also include efficiency, sound pressure level, etc. of the speaker. The preset speaker latent amount threshold is set based on the value of the speaker latent amount corresponding to the clearly identifiable audio information.

The signal-to-noise ratio is a parameter describing the proportional relationship of the active component and the noise component in the signal. The higher the signal-to-noise ratio of the speaker, the more clean the sound picked up by the speaker. For example, the terminal extracts a normal sound signal and a noise signal when there is no signal from the audio information, and calculates the signal-to-noise ratio of the audio information based on the normal sound signal and the noise signal.

For example, when the speaker latent variable is a signal-to-noise ratio, the terminal judges that the audio information to be identified is qualified when detecting that the value of the signal-to-noise ratio of the speaker is greater than or equal to 70, and can clearly identify the number in the audio information.

When the detection result is that the audio information is unqualified, prompting a speaker to reread the random reference digital string so as to reacquire the audio information; or re-acquiring the audio information to be identified from the database corresponding to the stored audio data. The reference number string is randomly generated or randomly acquired from a database in the process of identifying the identity of the speaker by the terminal, and is prompted to the user, the terminal can randomly play or display the reference number string before S101, and when the reference number string is played in a voice broadcasting mode, the reference number string is played in a standard pronunciation mode. The reference number string includes a preset number of digits, for example, 5 or 6 digits are included in the reference number string.

When the detection result is that the audio information is qualified, S2022 is executed.

S2022: and when the audio information is qualified, detecting whether the number string spoken by the to-be-detected person is identical with the reference number string or not based on the reference number string and the audio corresponding to the number string spoken by the to-be-detected person.

When the detection result is that the audio information is qualified, the terminal can sequentially extract voice fragments containing numbers from the audio information, identify the number strings contained in the audio information by adopting a voice recognition technology, compare the reference number strings with the identified number strings, and judge whether the number strings contained in the audio information are identical with the reference number strings.

Or the terminal can also play the reference digital string to obtain the audio corresponding to the reference digital string, compare the audio corresponding to the reference digital string with the audio corresponding to the digital string spoken by the person to be tested, and detect whether the digital string spoken by the person to be tested is identical with the reference digital string. The audio frequency obtained by playing the reference digital string is compared with the audio frequency corresponding to the digital string which is acquired by the to-be-detected person, and when the terminal picks up the audio frequency and plays the audio frequency information, the digital pronunciation deviation caused by the performance of the loudspeaker can be reduced.

When any number in the identified number string is different from any number in the reference number string or the arrangement order of a plurality of numbers is different, it is determined that the number string included in the audio information is different from the reference number string. At this time, S201 may be returned to reacquire the voice data to be recognized; the identification result can also be output as that the current speaker is not matched with the speaker corresponding to the reference number string.

When each number in the reference number string is identical to each number in the recognized number string and the arrangement order of the plurality of numbers is also identical, it is determined that the number string included in the audio information is identical to the reference number string, and S2023 is performed.

S2023: and when the detection results are the same, extracting digital latent variables from the audio information.

When the terminal confirms that the number string contained in the audio information is identical to the reference number string, the audio information is converted into matrix vectors, the matrix vectors are input into a DNN model for processing, and the digital latent variable of each number in the number string is extracted from the audio information. The digital latent variable is used for identifying the pronunciation characteristics of the number by the to-be-tested person in the audio information.

Specifically, the terminal may input the obtained audio information into a pre-trained DNN model, and extract a digital latent variable of each number in the audio information through a deep neural network. The number latent variable is used to identify the pronunciation characteristics of the same number. The same number may have at least two different numerical latent variables, i.e., at least two different pronunciations. For example, the pronunciation of the number "1" includes "one", "unitary", and the like.

S203: when the latent variable of the loudspeaker meets the preset requirement, inputting the digital latent variable into a preset Bayesian model for processing to obtain the likelihood ratio score of the audio information; wherein the preset requirement is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information; the Bayesian model is obtained by training digital latent variables of each number uttered by a single speaker in a sound sample set by using a machine learning algorithm; each digital latent variable has an identity tag identifying a speaker to which the digital latent variable belongs; the bayesian model has a correspondence to the single speaker in the sound sample set.

The terminal inputs the digital latent variable of each number contained in the audio information into a preset Bayesian model, and the digital latent variable is obtained by the following formula

The respective probability of each digit contained in the audio information being spoken by the speaker is calculated. Wherein (1)>

；/>

。

Wherein,,

representing the probability that a person speaks a number,/- >

Representing the coincidence mean as +.>

Variance is->

Is a gaussian distribution of (c); />

Representing the diagonal covariance of Ɛ; signal component->

=/>

I.e. representing the signal component->

Digital latent variable representing speaker i, +.>

Is defined as having diagonal covariance +>

Is (1) Gauss->

Representing the probability that the speaker is i, +.>

Is about->

Is a gaussian distribution of (c); />

The digital latent variable representing speaker j is +.>

Is defined as having diagonal covariance +>

Is (1) Gauss->

The representation is a summary of speaker jRate of->

Is about

Is a gaussian distribution of (c).

When the terminal calculates the respective corresponding probability of each number contained in the audio information, the likelihood ratio score is calculated based on the calculated probability.

The likelihood score represents the ratio of the probability that speaker i speaks the jth number to the probability that speaker i does not speak the jth number. For example, if the probability that speaker i speaks the jth digit is 0.6, then the probability that speaker i does not speak the jth digit is 0.4, and the likelihood ratio score is 0.6/0.4=1.5.

It can be understood that the terminal may calculate, for example, the respective corresponding probability of each number included in the audio information spoken by the speaker according to a preset iteration number, where the preset iteration number may be 10 times, or may be set according to actual needs.

At this time, the terminal may calculate an average likelihood ratio score of the speaker speaking the jth digit based on the likelihood ratio scores of the speaker speaking the jth digit a plurality of times, and use the average likelihood ratio as the likelihood ratio score of the speaker speaking the jth digit.

In order to improve the accuracy of the identity recognition result, the terminal can also screen out likelihood ratio scores greater than or equal to a preset likelihood ratio score threshold value based on likelihood ratio scores corresponding to the jth digit respectively which are spoken by the speaker for a plurality of times, calculate the mean value of the screened likelihood ratio scores, and take the calculated mean value as the likelihood ratio score of the jth digit which is spoken by the speaker. The preset likelihood ratio score threshold may be 1.2, but is not limited thereto, and may be specifically set according to practical situations, which is not limited herein.

S204: and outputting the identification result of the audio information based on the likelihood score.

Because the likelihood ratio score indicates the ratio of the probability that speaker i speaks the jth number to the probability that speaker i does not speak the jth number, the terminal can determine that the speaker of the audio information is speaker i when it is determined that the likelihood ratio score of speaker i speaking the jth number is greater than 1. When the speaker i speaks that the likelihood ratio score of the j-th digit is less than or equal to 1, it is determined that the speaker of the audio information is not the speaker i.

Further, in one embodiment, to facilitate speaker identity recognition using a common likelihood score threshold, recognition efficiency is improved. S204 may include: using the formula

Normalizing the likelihood ratio score, and outputting an identification result of the audio information based on the normalized likelihood ratio score; wherein (1)>

For the likelihood ratio score after the normalization processing, < >>

For the likelihood ratio score, +.>

Approximately mean value of the distribution of the score of the speaker being false, < >>

Standard deviation of the distribution of scores for false speakers.

The terminal can make a judgment using a common likelihood score threshold by converting likelihood scores from different speakers into similar ranges using the above equation. Wherein mu ₁ Sum sigma ₁ The approximate mean and standard deviation of the false speaker score distribution, respectively. The likelihood ratio score can be normalized by the following three normalization methods:

1) Zero normalization (Z-Norm) using a batchComputing average μ for non-target utterances of the target model ₁ And standard deviation sigma ₁ The method comprises the steps of carrying out a first treatment on the surface of the I.e., a normalization scheme by estimating the mean and variance of the reviewer score distribution and linearly transforming the score distribution. Z-Norm tests on a certain speaker probability model using a large array of imposter speech data to obtain imposter score distribution. Then, the mean mu is calculated from the masquerading score distribution ₁ And standard deviation sigma ₁ Mu is then added ₁ 、σ ₁ Substitution formula

And calculating likelihood ratio scores after normalization processing.

2) Test normalization (T-Norm) uses feature vectors of unknown speakers to calculate statistics for a set of fake speaker models; that is, normalization is performed based on the mean and standard deviation of the imposter score distribution. Unlike Z-Nor, T-Nor uses a large number of imposter speaker models rather than imposter voice data to calculate the mean and standard deviation. The normalization process is carried out during recognition, one piece of test voice data is simultaneously compared with the purported speaker model and a plurality of masquerade models to respectively obtain masquerade scores, and then the masquerade score distribution and the normalization parameter mu are calculated ₁ Sum sigma ₁ 。

3) Taking the average value of the normalized values calculated by the Z-Norm and the T-Norm as a final normalized value to form a score normalized by the likelihood ratio score s

. Wherein z-Norm and T-Norm are prior art and are not described here in detail.

Further, in another embodiment, in order to improve the reliability and accuracy of the recognition result and reduce the probability of erroneous judgment, S204 may include: determining an identity recognition result of the audio information based on the likelihood ratio score; checking whether the identity recognition result is credible or not by adopting a likelihood ratio verification method, and outputting the identity recognition result when the identity recognition result is verified to be credible; and when the identity recognition result is verified to be not trusted, returning to S201 or ending the flow.

Referring to fig. 3 together, fig. 3 is a schematic diagram of null hypothesis and alternative hypothesis according to an embodiment of the present invention.

In particular, the verification is considered to be a kind of zero hypothesis H ₀ In which the audio vector i has the same speaker and digital latent variable u _i And v _j Alternative hypothesis H ₁ . i. j are positive integers greater than or equal to 1.

Null hypothesis H ₀ A person corresponds to a spoken number; alternative hypothesis H ₁ One person corresponds to a plurality of numbers, or one number corresponds to a plurality of persons, or a plurality of persons are mixed with a plurality of numbers. U (U) ₁ 、U ₂ 、V ₁ 、V ₂ Representing different speakers.

The terminal can verify by comparing the likelihood of data under different assumptions as shown in fig. 3. Xt represents a number that a person speaks; for example: i says the j number. Xs represents that the number is what is said; for example: the j number is i.

Error representing Xt, ++>

Representing the error of Xs.

At zero assume H ₁ The features Xt and Xs do not match.

According to hypothesis H ₀ And judging that the two sides are consistent, and matching the characteristics Xt and Xs.

When the characteristics Xt and Xs are matched, the identity recognition result of the audio information is determined to be accurate and reliable, and the identity recognition result is output.

When the features Xt and Xs are not matched, it is determined that the identification result of the audio information is not authentic, and there may be erroneous determination, and at this time, the process returns to S201 or ends to identify the speaker.

Further, after S204, S205 may further include: and when the identity recognition result is credible and the identity verification is passed, responding to a voice control instruction of a speaker corresponding to the audio information, and executing a preset operation corresponding to the voice control instruction.

The terminal judges whether the speaker corresponding to the identity recognition result is a legal user or not based on preset legal identity information when the identity recognition result is obtained to be reliable, and judges that verification is passed when the speaker corresponding to the identity recognition result is the legal user. And then, when the voice control instruction input by the speaker is obtained, responding to the voice control instruction, obtaining the preset operation corresponding to the voice control instruction, and executing the preset operation corresponding to the voice control instruction.

For example, the voice control instruction is a search instruction, and when the search instruction is used for searching for an article, the search instruction is responded to search for the relevant information of the article corresponding to the search instruction from a local database or a network database.

According to the embodiment of the invention, the speaker latent variable and the digital latent variable of the audio information to be identified are extracted; when the latent variable of the loudspeaker meets the preset requirement, inputting the digital latent variable into a preset Bayesian model for processing to obtain the likelihood ratio score of the audio information; and outputting the identification result of the audio information based on the likelihood ratio score. Because the preset requirement is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information, when the speaker latent variable in the audio information meets the preset requirement, the interference of the performance of the speaker on the identity recognition result can be eliminated, at the moment, the identity information of the speaker is recognized based on the digital latent variable of each number uttered by the person to be tested, and because the number of the digital latent variables of each number can be multiple, the identity of the speaker can be accurately recognized even if the speaker utters the same number at different moments, the situation that different speakers utter the same number at different moments can be avoided, and the condition that the identity recognition result is interfered can be improved. The identity recognition result of the audio information is output based on the likelihood ratio score of the audio information, so that the misjudgment probability can be reduced, and the accuracy of the recognition result is further improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

Referring to fig. 4, fig. 4 is a schematic diagram of a terminal according to an embodiment of the invention. Each unit included in the terminal is used for executing each step in the embodiment corresponding to fig. 1-2. Refer to fig. 1 to fig. 2 for a description of the corresponding embodiments. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 4, the terminal 4 includes:

an obtaining unit 410, configured to obtain audio information to be identified that is spoken by a tester for the reference digital string; the reference digital string is pre-stored and randomly played or randomly displayed, and the audio information comprises audio corresponding to the digital string spoken by the to-be-detected person;

an extracting unit 420 for extracting speaker latent variables and digital latent variables of the audio information; the digital latent variable is used for identifying the feature information of the loudspeaker, the digital latent variable is extracted when the fact that the number string spoken by the to-be-tested person is identical to the reference number string is confirmed, and the digital latent variable is used for identifying the pronunciation feature of the to-be-tested person on the number in the audio information;

The recognition unit 430 is configured to input the digital latent variable into a preset bayesian model to perform voiceprint recognition when the speaker latent variable meets a preset requirement, so as to obtain an identity recognition result; wherein the preset requirement is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information; the Bayesian model is obtained by training digital latent variables of each number uttered by a single speaker in a sound sample set by using a machine learning algorithm; each digital latent variable has an identity tag identifying the speaker to which the digital latent variable belongs; the bayesian model has a correspondence to the single speaker in the sound sample set.

Further, the extraction unit 420 includes:

a first detection unit for extracting a speaker latent variable from the audio information and detecting whether the audio information is acceptable or not based on a value of the speaker latent variable;

the second detection unit is used for detecting whether the number string spoken by the to-be-detected person is identical with the reference number string or not based on the reference number string and the audio corresponding to the number string spoken by the to-be-detected person when the audio information is qualified;

And the latent variable extraction unit is used for extracting the digital latent variable from the audio information when the detection results are the same.

Further, the recognition unit 430 includes:

the computing unit is used for inputting the digital latent variable into a preset Bayesian model for processing to obtain likelihood ratio scores of the audio information;

and the identity recognition unit is used for outputting an identity recognition result of the audio information based on the likelihood score.

Further, the identity recognition unit is specifically configured to: using the formula

For the likelihood ratio score after the normalization processing, < >>

For the likelihood ratio score, +.>

Standard deviation of the distribution of scores for false speakers.

Further, the identity recognition unit is specifically configured to: determining an identity recognition result of the audio information based on the likelihood ratio score; checking whether the identity recognition result is credible or not by adopting a likelihood ratio verification method, and outputting the identity recognition result when the identity recognition result is verified to be credible; when the identity recognition result is not trusted, the process is ended or the obtaining unit 410 performs the process of obtaining the audio information to be recognized, which is spoken by the person to be tested for the reference digital string.

Fig. 5 is a schematic diagram of a terminal according to another embodiment of the present invention. As shown in fig. 5, the terminal 5 of this embodiment includes: a processor 50, a memory 51 and a computer program 52 stored in said memory 51 and executable on said processor 50. The processor 50, when executing the computer program 52, implements the steps in the above-described method embodiment for identifying a speaker for each terminal, such as S101 to S103 shown in fig. 1. Alternatively, the processor 50, when executing the computer program 52, performs the functions of the units in the above-described device embodiments, for example, the functions of the units 310 to 340 shown in fig. 4.

By way of example, the computer program 52 may be partitioned into one or more units that are stored in the memory 51 and executed by the processor 50 to complete the present invention. The one or more elements may be a series of computer program instruction segments capable of performing a specific function describing the execution of the computer program 52 in the terminal 4. For example, the computer program 52 may be divided into an acquisition unit, an extraction unit and an identification unit, each unit functioning specifically as described above.

The terminal may include, but is not limited to, a processor 50, a memory 51. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the terminal 5 and is not intended to limit the terminal 5, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the terminal may further include an input-output terminal, a network access terminal, a bus, etc.

The processor 50 may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 51 may be an internal storage unit of the terminal 5, such as a hard disk or a memory of the terminal 5. The memory 51 may be an external storage terminal of the terminal 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal 5. Further, the memory 51 may also include both an internal storage unit of the terminal 5 and an external storage terminal. The memory 51 is used for storing the computer program as well as other programs and data required by the terminal. The memory 51 may also be used to temporarily store data that has been output or is to be output.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A method of identifying a speaker, comprising:

When the speaker latent variable meets a preset requirement, inputting the digital latent variable into a preset Bayesian model for voiceprint recognition to obtain an identity recognition result; wherein the preset requirement is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information; the Bayesian model is obtained by training digital latent variables of each number uttered by a single speaker in a sound sample set by using a machine learning algorithm, and each digital latent variable is provided with an identity tag for identifying the speaker to which the digital latent variable belongs; the bayesian model has a correspondence to the single speaker in the sound sample set.

2. The method of claim 1, wherein the extracting speaker latent variables and digital latent variables of the audio information comprises:

extracting a speaker latent variable from the audio information, and detecting whether the audio information is qualified or not based on the value of the speaker latent variable;

when the audio information is qualified, detecting whether the number string spoken by the to-be-detected person is identical with the reference number string or not based on the reference number string and the audio corresponding to the number string spoken by the to-be-detected person;

And when the detection results are the same, extracting digital latent variables from the audio information.

3. The method according to claim 1 or 2, wherein when the speaker latent variable meets a preset requirement, inputting the digital latent variable into a preset bayesian model for voiceprint recognition to obtain an identity recognition result, including:

inputting the digital latent variable into a preset Bayesian model for processing to obtain likelihood ratio scores of the audio information;

and outputting the identification result of the audio information based on the likelihood score.

4. A method according to claim 3, wherein said outputting the identification result of the audio information based on the likelihood score comprises:

using the formula

For the likelihood ratio score after the normalization processing, < >>

For the likelihood ratio score, +.>

Standard deviation of the distribution of scores for false speakers.

5. A method according to claim 3, wherein said outputting the identification result of the audio information based on the likelihood score comprises:

Determining an identity recognition result of the audio information based on the likelihood ratio score;

checking whether the identity recognition result is credible or not by adopting a likelihood ratio verification method, and outputting the identity recognition result when the identity recognition result is verified to be credible; and when the identity recognition result is verified to be unreliable, returning the audio information to be recognized, which is spoken by the person to be tested aiming at the reference number string, or ending.

6. A terminal, comprising:

The identification unit is used for inputting the digital latent variable into a preset Bayesian model for voiceprint identification when the speaker latent variable meets the preset requirement, so as to obtain an identity identification result; wherein the preset requirement is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information; the Bayesian model is obtained by training digital latent variables of each number uttered by a single speaker in a sound sample set by using a machine learning algorithm; each digital latent variable is provided with an identity tag for identifying a speaker to which the digital latent variable corresponds; the bayesian model has a correspondence to the single speaker in the sound sample set.

7. A terminal comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of when executing the computer program:

when the speaker latent variable meets a preset requirement, inputting the digital latent variable into a preset Bayesian model for voiceprint recognition to obtain an identity recognition result; wherein the preset requirement is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information; the Bayesian model is obtained by training digital latent variables of each number uttered by a single speaker in a sound sample set by using a machine learning algorithm; each of the digital latent variables has an identity tag identifying a speaker to which the digital latent variable belongs; the bayesian model has a correspondence to the single speaker in the sound sample set.

8. The terminal of claim 7, wherein the extracting speaker latent variables and digital latent variables of the audio information comprises:

9. The terminal of claim 8, wherein when the speaker latent variable meets a preset requirement, inputting the digital latent variable into a preset bayesian model for voiceprint recognition to obtain an identity recognition result, and the method comprises:

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 5.