CN110111798A

CN110111798A - A kind of method and terminal identifying speaker

Info

Publication number: CN110111798A
Application number: CN201910354414.7A
Authority: CN
Inventors: 张丝潆; 曾庆亮; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-08-09
Anticipated expiration: 2039-04-29
Also published as: WO2020220541A1; CN110111798B

Abstract

The present invention is suitable for field of computer technology, provides a kind of method and terminal for identifying speaker, this method comprises: obtaining person under test is directed to the audio-frequency information to be identified that benchmark numeric string is said；Audio-frequency information includes numeric string；Extract the loudspeaker latent variable and digital latent variable of audio-frequency information；Loudspeaker latent variable is used to identify the characteristic information of loudspeaker, and digital latent variable is used to identify in audio-frequency information person under test to the pronunciation character of number；When loudspeaker latent variable meets preset requirement, digital latent variable is inputted into preset Bayesian model and carries out Application on Voiceprint Recognition, obtains identification result.The embodiment of the present invention, identity information based on loudspeaker latent variable and digital latent variable identification speaker in audio-frequency information, can be avoided has different pronunciations for identical number because of different loudspeakers, and speaker is different in the pronunciation of different moments for same numbers, the case where to interfere identification result, it can be improved the accuracy of identification result.

Description

A kind of method and terminal identifying speaker

Technical field

The invention belongs to field of computer technology more particularly to a kind of methods and terminal for identifying speaker.

Background technique

With the fast development of information technology and network technology, demand of the people to identity recognizing technology is more and more.Base Many shortcomings (such as security reliability has been exposed in practical applications in the identity recognizing technology of conventional cipher certification It is lower), and it is also increasingly mature in recent years and show it in practical applications based on the identity recognizing technology that biological characteristic distinguishes Superiority.Wherein, sound groove recognition technology in e is one of the identity recognizing technology distinguished based on biological characteristic.

Vocal print refers to the hum pattern of speaker's voice spectrum.Since everyone vocal organs are different, the sound issued Sound and its tone are different, and therefore, identification is carried out using vocal print as essential characteristic has irreplaceability and stability. Application on Voiceprint Recognition has two kinds of text relevant (Text-Dependent) and unrelated (Text-Independent) of text.

To the unrelated Speaker Identification of the free text of voice content on the contrary, the relevant speaker verification's system of text It is more advantageous to security application, because they tend to show higher accuracy in session in short-term.

Typical text correlation Speaker Identification is to allow each user using fixed phrase, short to match registration and test Language.In this case, it can then be played with pre-recorded language from the user.In training and language is tested with not In the case where with scene, by sharing identical voice content, the safety of identification protection can be improved to a certain extent, is System can provide some numeric strings at random, and user, which need to correctly read out corresponding content, just can recognize vocal print, the introducing of this randomness So that collected vocal print has the difference on content timeline each time in text Classical correlation.However, using random prompt When connected digits recognition speaker, since part sample digit vocabulary is fixed and numeral sample is limited, and for identical number, Different loudspeakers has different pronunciations, will lead to the identity that can not accurately identify speaker in this way.

Summary of the invention

In view of this, the embodiment of the invention provides a kind of methods and terminal for identifying speaker, to solve the prior art In, for complicated vocal print voice messaging (for example, short speech, imitate voice etc.), the Voiceprint Recognition System of text independent type without Method accurately extracts the phonetic feature of speaker, the problem of so as to cause the identity that can not accurately identify speaker.

The first aspect of the embodiment of the present invention provides a kind of method for identifying speaker, comprising:

It obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said；Wherein, the benchmark numeric string is It is stored in advance, and shuffle or random display, the audio-frequency information includes the corresponding sound of numeric string that the person under test says Frequently；

Extract the loudspeaker latent variable and digital latent variable of the audio-frequency information；Wherein, the loudspeaker latent variable is used In the characteristic information of mark loudspeaker, the number latent variable is in the numeric string and the base value for confirming that the person under test says It is extracted when word string is identical, it is special to the pronunciation of number that the number latent variable is used to identify person under test described in the audio-frequency information Sign；

When the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset Bayesian model Application on Voiceprint Recognition is carried out, identification result is obtained；Wherein, the preset requirement is based on corresponding to the clear and legible audio-frequency information recognized The value of loudspeaker latent variable be configured；The Bayesian model is to be concentrated by using machine learning algorithm to sample sound The digital latent variable that single speaker says each number is trained to obtain；Each digital latent variable, which has, identifies the number The identity label of speaker belonging to word latent variable；What the Bayesian model and the sample sound were concentrated described single speaks Person has corresponding relationship.

The second aspect of the embodiment of the present invention provides a kind of terminal, comprising:

Acquiring unit is directed to the audio-frequency information to be identified that benchmark numeric string is said for obtaining person under test；Wherein, described Benchmark numeric string is to be stored in advance, and shuffle or random display, the audio-frequency information include the number that the person under test says The corresponding audio of word string；

Extraction unit, for extracting the loudspeaker latent variable and digital latent variable of the audio-frequency information；Wherein, described to raise Sound device latent variable is used to identify the characteristic information of loudspeaker, and the number latent variable is in the numeric string for confirming that the person under test says It is extracted when identical as the benchmark numeric string, the number latent variable is for identifying person under test's logarithm described in the audio-frequency information The pronunciation character of word；

Recognition unit, for the digital latent variable being inputted pre- when the loudspeaker latent variable meets preset requirement If Bayesian model carry out Application on Voiceprint Recognition, obtain identification result；Wherein, the Bayesian model is by using machine The digital latent variable that learning algorithm concentrates single speaker to say each number sample sound is trained to obtain；It is described each Digital latent variable has the identity label for identifying the corresponding affiliated speaker of the number latent variable；The Bayesian model with it is described The single speaker that sample sound is concentrated has corresponding relationship.

The third aspect of the embodiment of the present invention provides a kind of terminal, including memory, processor and is stored in described In memory and the computer program that can run on the processor, the processor are realized when executing the computer program Following steps:

The fourth aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage Media storage has computer program, and the computer program performs the steps of when being executed by processor

Implement a kind of method for identifying speaker and terminal provided in an embodiment of the present invention to have the advantages that

The embodiment of the present invention, by the loudspeaker latent variable and digital latent variable of extracting audio-frequency information to be identified；When When loudspeaker latent variable meets the requirements, digital latent variable is inputted into preset Bayesian model and carries out Application on Voiceprint Recognition, obtains identity Recognition result.Since preset requirement is that the value based on loudspeaker latent variable corresponding to the clear and legible audio-frequency information recognized is set It sets, when the loudspeaker latent variable in audio-frequency information meets preset requirement, loudspeaker performance itself can be excluded to numeric utterance The identity of the digital latent variable identification speaker of number each of is said in interference to identification result based on person under test at this time Information, due to the digital latent variable of each number can have it is multiple, instant speaker is in different moments for same numbers Pronunciation is different, can also accurately identify the identity of speaker, can be avoided has identical number because of different loudspeakers Different pronunciations and speaker is different in the pronunciation of different moments for same numbers, to interfere identification result Situation can be improved the accuracy of identification result.

Detailed description of the invention

It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only of the invention some Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these Attached drawing obtains other attached drawings.

Fig. 1 is a kind of implementation flow chart of the method for identification speaker that one embodiment of the invention provides；

Fig. 2 be another embodiment of the present invention provides a kind of identification speaker method implementation flow chart；

Fig. 3 is the schematic diagram of the null hypothesis that one embodiment of the invention provides and alternative hypothesis；

Fig. 4 is a kind of schematic diagram for terminal that one embodiment of the invention provides；

Fig. 5 be another embodiment of the present invention provides a kind of terminal schematic diagram.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Referring to Figure 1, Fig. 1 is a kind of implementation flow chart of method for identifying speaker provided in an embodiment of the present invention.This The executing subject that the method for speaker is identified in embodiment is terminal.Terminal include but is not limited to smart phone, tablet computer, can The mobile terminals such as wearable device can also be desktop computer etc..The method of identification speaker as shown in the figure can include:

S101: it obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said；Wherein, the benchmark number String is to be stored in advance, and shuffle or random display, the audio-frequency information include the numeric string correspondence that the person under test says Audio.

Terminal when detecting Speaker Identification instruction, can by built-in sound pickup device (for example, microphone, Loudspeaker) audio-frequency information to be identified that the speaker in ambient enviroment issues is obtained, at this point, the audio-frequency information that speaker issues It is that the benchmark numeric string provided at random according to terminal issues；Or terminal according to Speaker Identification instruct in include file identification It obtains this document and identifies corresponding audio file or video file, and extract the audio-frequency information in audio file or video file, It is identified as audio-frequency information to be identified.Read out what benchmark numeric string obtained comprising person under test in audio file or video file Audio-frequency information, audio file or video file can be user's upload, can also be from for storing audio file or video text Downloading obtains in the server of part.Benchmark numeric string is to be stored in advance in the terminal, and by terminal shuffle or random display. The quantity of benchmark numeric string may include multiple.

Audio-frequency information to be identified includes the corresponding audio of numeric string, and numeric string is made of at least one number.

S102: the loudspeaker latent variable and digital latent variable of the audio-frequency information are extracted；Wherein, the loudspeaker creep The extraction when the numeric string for confirming that the person under test says is identical as the benchmark numeric string is measured, the number latent variable is for marking Know the characteristic information of loudspeaker, the number latent variable is used to identify pronunciation of the person under test described in the audio-frequency information to number Feature.

Terminal calculates the loudspeaker variable of audio-frequency information based on the audio-frequency information got.Loudspeaker latent variable include but It is not limited to signal-to-noise ratio, can also include efficiency, the sound pressure level etc. of loudspeaker.

Signal-to-noise ratio (signal-to-noise ratio) is the proportionate relationship for describing effective component and noise contribution in signal Parameter.Wherein, the signal-to-noise ratio of loudspeaker is higher, and the sound that loudspeaker picks up is more clear.For example, terminal is extracted from audio-frequency information Noise signal when normal voice signal and no signal, is based on normal voice signal and noise signal, calculates audio The signal-to-noise ratio of information.

The audio-frequency information that terminal can will acquire inputs trained deep neural network (Deep Neural in advance Networks, DNN) model, the digital latent variable of each number in audio-frequency information is extracted by deep neural network.Digital creep Measure the pronunciation character for identifying same number.The same number can have at least two different digital latent variables, that is, have There are at least two different pronunciations.For example, the pronunciation of digital " 1 " includes " one ", " " etc..

In the present embodiment, deep neural network model includes input layer, hidden layer and output layer.Input layer includes one defeated Enter node layer, for receiving the audio-frequency information inputted from external.Hidden layer includes more than two hidden layer nodes, for sound Frequency information is handled, and the digital latent variable of audio-frequency information is extracted.Output layer exports audio-frequency information for exporting processing result Digital latent variable.

Deep neural network model is based on sample sound training and gets, and sample sound concentration is said each including speaker The corresponding sample sound of number.The number of speaker can be multiple, such as 500 or 1500, trained sample to a certain extent This quantity is more, as a result more accurate when being identified using the neural network model that training obtains.It include default in sample sound The sample sound of number, each markd digital latent variable of sample sound tool, each number latent variable and sample label are one by one It is corresponding.Sample sound can only include a number, and each number that sample sound is concentrated can be corresponding at least two numbers Latent variable.The number that sample sound includes includes the number that all benchmark numeric strings that may be generated at random include.For example, random When the benchmark numeric string of generation includes any 6 numbers in 0~9,10 numbers, it includes 0~9 this 10 that sample sound, which is concentrated, The sample sound of number.

During training, the input data for inputting deep neural network model is audio-frequency information, which can To be the vector matrix obtained based on audio-frequency information, vector matrix from audio-frequency information by successively extracting comprising digital audio The corresponding vector composition of data.The output data of deep neural network model is the digital creep of each number in audio-frequency information Amount.

It is understood that the corresponding neural network recognization model of speaker, when needing to identify multiple speakers When, the multiple corresponding neural network recognization models of speaker of training.

Terminal judges whether the loudspeaker latent variable extracted meets when extracting loudspeaker latent variable, based on preset condition Preset requirement.For preset condition for judging whether the sound in audio-frequency information can clearly be recognized, preset condition can be based on clear The value of loudspeaker latent variable corresponding to the clear sound that can be recognized is configured.For example, when loudspeaker latent variable is signal-to-noise ratio When, preset condition can be signal-to-noise ratio and be greater than or equal to default snr threshold.Default snr threshold is that can clearly recognize Sound corresponding to signal-to-noise ratio.

When loudspeaker latent variable meets the requirements, S103 is executed；When loudspeaker latent variable is undesirable, can return S101, can also export identification result is that current speaker speaker corresponding with benchmark numeric string mismatches.

S103: when the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset pattra leaves This model carries out Application on Voiceprint Recognition, obtains identification result；Wherein, the preset requirement is based on the clear and legible audio-frequency information recognized The value of corresponding loudspeaker latent variable is configured；The Bayesian model is by using machine learning algorithm to sound sample This digital latent variable for concentrating single speaker to say each number is trained to obtain；Each digital latent variable has mark Know the identity label of speaker belonging to the number latent variable；The list that the Bayesian model and the sample sound are concentrated One speaker has corresponding relationship.

For example, when speaker, which says certain number, has the number latent variable, then, the identity label of the number latent variable It indicates: saying the identity of the speaker of the number.

In the present embodiment, the expression formula of preset Bayesian model is as follows:

p(x_ijk|u_i,v_j, θ) and=N (x_ijk|μ+u_i+v_j,∑_ε)；Wherein, p (u_i)=N (u_i|0,∑_u)；p(v_j)=N (v_j| 0,∑_v)。

Wherein, p (x_ijk|u_i,v_j, θ) and indicate that a people has said a digital probability, x_ijkIt indicates the in k-th of session I people has said j-th of number.When due to doing authentication or typing information, it is repeatedly different may to may require that speaker says Numeric string, therefore, k indicate k-th of session, and θ is the general designation of parameter in Bayesian model.Conditional probability N (x_ijk|μ+u_i+v_j, ∑_ε) indicate that meeting mean value is μ+u_i+v_j, variance is ∑_εGaussian Profile；∑_εIndicate the diagonal covariance of ε；Signal component x_ijk =μ+u_i+v_j+_εijk, i.e. representation signal component x_ijkDepending on speaker and number；Noise component(s) ε_ijkIt indicates in k-th of session I-th of people says the deviation or noise of j-th of number, and μ indicates the population mean of training vector.

u_iIndicate the digital latent variable of speaker i, u_iIt is defined as diagonal covariance ∑_uGauss, p (u_i) table Show that speaker is the probability of i, N (u_i|0,∑_u) it is about ∑_uGaussian Profile；v_jIndicate the digital latent variable of speaker j, It is v_jIt is defined as diagonal covariance ∑_vGauss, p (v_j) indicate be speaker j probability, N (v_j|0,∑_v) be about ∑_vGaussian Profile.

In form, Bayesian model can describe N (x | μ, Σ) with conditional probability indicates Gauss in x, has average μ With covariance Σ.

Terminal is when completing iterative calculation speaker and having said in audio-frequency information each of to include number respectively corresponding probability, base Number corresponding probability each of is obtained in each iterative calculation, identification is carried out to speaker, obtain identification result.

For example, iteration total degree is 10 times, speaker i has said that a digital probability is greater than or equal to predetermined probabilities threshold value When (such as 0.8), remembers 1 point, when speaker i has said that a digital probability is less than predetermined probabilities threshold value (such as 0.8), remembers 0 point, The total score for counting speaker i after iterating to calculate 10 times, when total score is greater than or equal to default point threshold (such as 7 points), Determine the corresponding artificial speaker i that speaks of audio-frequency information.

The embodiment of the present invention, by the loudspeaker latent variable and digital latent variable of extracting audio-frequency information to be identified；When When loudspeaker latent variable meets the requirements, digital latent variable is inputted into preset Bayesian model and carries out Application on Voiceprint Recognition, obtains identity Recognition result.Since preset requirement is that the value based on loudspeaker latent variable corresponding to the clear and legible audio-frequency information recognized is set Set, when the loudspeaker latent variable in audio-frequency information meets preset requirement, can exclude because loudspeaker itself to numeric utterance because Difference and interference to identification result, the digital latent variable identification for each of saying number based on person under test at this time are spoken The identity information of people, due to the digital latent variable of each number can have it is multiple, instant speaker different moments for Same numbers pronunciation is different, can also accurately identify the identity of speaker, can be avoided because different loudspeakers is for identical Number has different pronunciations and speaker different in the pronunciation of different moments for same numbers, so that identity be interfered to know The case where other result, can be improved the accuracy of identification result.

Refer to Fig. 2, Fig. 2 be another embodiment of the present invention provides a kind of identification speaker method implementation process Figure.The executing subject that the method for speaker is identified in the present embodiment is terminal.Terminal includes but is not limited to smart phone, plate electricity The mobile terminals such as brain, wearable device can also be desktop computer etc..The method of the identification speaker of the present embodiment includes following Step:

S201: it obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said；Wherein, the benchmark number String is to be stored in advance, and shuffle or random display, the audio-frequency information include the numeric string correspondence that the person under test says Audio.

S201 is identical as S101 in a upper embodiment in the present embodiment, referring specifically to the correlation of S101 in a upper embodiment Description, does not repeat herein.

S202: the loudspeaker latent variable and digital latent variable of the audio-frequency information are extracted；Wherein, the loudspeaker creep The extraction when the numeric string for confirming that the person under test says is identical as the benchmark numeric string is measured, the number latent variable is for marking Know the characteristic information of loudspeaker, the number latent variable is used to identify pronunciation of the person under test described in the audio-frequency information to number Feature.

S202 is identical as S102 in a upper embodiment in the present embodiment, referring specifically to the correlation of S102 in a upper embodiment Description, does not repeat herein.

Further, S202 may include S2021~S2023.It is specific as follows:

S2021: loudspeaker latent variable is extracted from the audio-frequency information, and is detected based on the value of the loudspeaker latent variable Whether audio-frequency information is qualified.

Specifically, terminal can extract loudspeaker latent variable, and the loudspeaking based on extraction from audio-frequency information to be identified The value of device latent variable, preset loudspeaker latent variable threshold value, whether detection audio-frequency information is qualified, to confirm the sound in audio-frequency information Whether sound can clearly be recognized.Loudspeaker latent variable is used to identify the characteristic information of loudspeaker.Loudspeaker latent variable includes but not It is limited to signal-to-noise ratio, can also includes efficiency, the sound pressure level etc. of loudspeaker.Preset loudspeaker latent variable threshold value is to be based on clearly may be used The value of loudspeaker latent variable corresponding to the audio-frequency information of identification is configured.

Signal-to-noise ratio is the proportionate relationship parameter for describing effective component and noise contribution in signal.The signal-to-noise ratio of loudspeaker is got over Height, the sound that loudspeaker picks up are more clear.For example, when terminal extracts normal voice signal and no signal from audio-frequency information Noise signal, be based on normal voice signal and noise signal, calculate the signal-to-noise ratio of audio-frequency information.

For example, terminal is greater than or waits in the value for the signal-to-noise ratio for detecting loudspeaker when loudspeaker latent variable is signal-to-noise ratio When 70, determines that audio-frequency information to be identified is qualified, can clearly recognize the number in audio-frequency information.

When testing result is that audio-frequency information is unqualified, speaker is prompted to read random benchmark numeric string again, to obtain again Take audio-frequency information；Or audio-frequency information to be identified is reacquired from the corresponding database of storage audio data.Wherein, benchmark Numeric string is that terminal is generated at random during identifying speaker's identity or obtained at random from database, and is prompted to user Numeric string, terminal terminal shuffle or can show the benchmark numeric string before S101, when the side using voice broadcast When formula playback references numeric string, with standard pronunciation playback references numeric string.Benchmark numeric string includes the number of preset number, example It such as, include 5 or 6 numbers in benchmark numeric string.

When testing result is audio-frequency information qualification, S2022 is executed.

S2022: when the audio-frequency information qualification, the number said based on the benchmark numeric string and the person under test It goes here and there corresponding audio, detects numeric string that the person under test says and whether the benchmark numeric string is identical.

When testing result is audio-frequency information qualification, terminal can successively extract the voice comprising number from audio-frequency information Segment, the numeric string for being included using speech recognition technology identification audio-frequency information, by benchmark numeric string and the numeric string identified It is compared, judges whether numeric string that audio-frequency information includes and benchmark numeric string are identical.

Alternatively, terminal can also play the benchmark numeric string, the corresponding audio of benchmark numeric string is obtained, by benchmark numeric string Corresponding audio audio corresponding with the numeric string that person under test says is compared, the numeric string and benchmark that detection person under test says Whether numeric string is identical.The audio obtained by playing the benchmark numeric string, it is corresponding with the numeric string that person under test says is collected Audio be compared, terminal pick up audio and play audio-frequency information when, it is possible to reduce because of the performance logarithm of loudspeaker itself Word pronunciation deviation.

When any number in the numeric string identified it is different from any number in benchmark numeric string or it is multiple number When putting in order not identical, the numeric string for determining that audio-frequency information includes is different from benchmark numeric string.At this point it is possible to return to S201 weight Newly obtain voice data to be identified；It is that current speaker is corresponding with benchmark numeric string that identification result, which can also be exported, Speaker mismatches.

When each of benchmark numeric string number is identical with each number in the numeric string identified, and it is multiple digital When putting in order also identical, the numeric string for determining that audio-frequency information includes is identical as benchmark numeric string, executes S2023.

S2023: when testing result is identical, digital latent variable is extracted from the audio-frequency information.

Audio-frequency information is converted into matrix when the numeric string for confirming that audio-frequency information includes is identical as benchmark numeric string by terminal Vector input DNN model is handled, and the digital latent variable of each of numeric string number is extracted from audio-frequency information.Number is latent Variable is used to identify person under test in the audio-frequency information to the pronunciation character of number.

Specifically, the audio-frequency information that terminal can will acquire inputs trained DNN model in advance, passes through depth nerve Network extracts the digital latent variable of each number in audio-frequency information.Digital latent variable is used to identify the pronunciation character of same number. The same number can have at least two different digital latent variables, that is, have at least two different pronunciations.For example, digital The pronunciation of " 1 " includes " one ", " " etc..

S203: when the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset pattra leaves This model is handled, and the likelihood ratio score of the audio-frequency information is obtained；Wherein, the preset requirement is recognized based on clear and legible The value of loudspeaker latent variable corresponding to audio-frequency information is configured；The Bayesian model is by using machine learning algorithm The digital latent variable for concentrating single speaker to say each number sample sound is trained to obtain；Each digital creep Measurer has the identity label for identifying speaker belonging to the number latent variable；The Bayesian model and the sample sound are concentrated The single speaker have corresponding relationship.

The digital latent variable that audio-frequency information each of includes number is inputted preset Bayesian model by terminal, and by with Lower formula p (x_ijk|u_i,v_j, θ) and=N (x_ijk|μ+u_i+v_j,∑_ε) calculate speaker said each number for including in audio-frequency information Respectively corresponding probability.Wherein, p (u_i)=N (u_i|0,∑_u)；p(v_j)=N (v_j|0,∑_v)。

Terminal is when calculating speaker and having said in audio-frequency information each of includes number respectively corresponding probability, based on calculating Obtained probability calculation likelihood ratio score.

Likelihood ratio fraction representation speaker i has said the probability of j-th of number and has not been that speaker i has said j-th of number The ratio between probability.For example, not being that speaker i has said j-th of number if speaker i has said that the probability of j-th of number is 0.6 The probability of word is 0.4, and likelihood ratio score is 0.6/0.4=1.5.

It has said in audio-frequency information and has wrapped it is understood that terminal can (for example) be calculated speaker by preset the number of iterations Each of contain number respectively corresponding probability, wherein preset the number of iterations can be 10 times, can also be set according to actual needs It sets.

At this point, terminal can repeatedly say j-th of corresponding likelihood ratio score of number based on speaker, calculating is spoken People has said the average likelihood score of j-th of number, and has said j-th of number using the likelihood score ratio that is averaged as speaker is calculated The likelihood ratio score of word.

In order to improve the accuracy of identification result, terminal is also based on speaker and has repeatedly said that j-th of number is each Self-corresponding likelihood ratio score filters out the likelihood ratio score for being greater than or equal to default likelihood ratio score threshold, and calculating sifting The mean value of likelihood ratio score out has said the likelihood ratio score of j-th of number using the mean value being calculated as speaker.It is default Likelihood ratio score threshold can be 1.2, and but it is not limited to this, can specifically be configured according to the actual situation, herein with no restrictions.

S204: the identification result of the audio-frequency information is exported based on the likelihood ratio score.

Since likelihood ratio fraction representation speaker i has said the probability of j-th of number and is not that speaker i has said j-th of number The ratio between probability of word, therefore, terminal can determine when confirming that speaker i has said the likelihood ratio score of j-th of number greater than 1 The artificial speaker i that speaks of audio-frequency information.When speaker i has said the likelihood ratio score of j-th of number less than or equal to 1, sentence Determining speaking artificially for audio-frequency information is not speaker i.

Further, in one embodiment, for the ease of using common likelihood ratio score threshold to identify the person of speaking Part, to improve recognition efficiency.S204 may include: using formula s'=(s- μ₁)/δ₁The likelihood ratio score is normalized It handles, and exports the identification result of the audio-frequency information based on the likelihood ratio score after the normalized；Wherein, s' For the likelihood ratio score after the normalized, s is the likelihood ratio score, μ₁It is the approximation of false speaker's score distribution Average value, δ₁It is the standard deviation of false speaker's score distribution.

Likelihood ratio score from different speakers is converted to similar range by using above formula by terminal, so as to Judged using common likelihood ratio score threshold.Wherein μ₁And σ₁It is that the approximation that false speaker's score is distributed is average respectively Value and standard deviation.Wherein it is possible to likelihood ratio score is normalized by following three kinds of method for normalizing:

1) zero normalization (Z-Norm) calculates average μ for the non-targeted language of object module using a batch₁And standard Poor σ₁；That is, emitting the mean value and variance and the regular scheme for being distributed progress linear transformation to score that the person's of recognizing score is distributed by estimation. Z-Norm is directed to some speaker's probabilistic model, is tested using the large quantities of person's of recognizing speech utterance data that emit, obtains jactitator Score distribution.Then, from pretend to be score be distributed in mean μ is calculated₁And standard deviation sigma₁, then by μ₁、σ₁Substitute into formula s'= (s-μ₁)/δ₁Likelihood ratio score after calculating normalized.

2) test normalization (T-Norm) counts one group of false speaker models using the feature vector of unknown speaker Calculate statistical data；It is normalized namely based on the mean value and standard deviation of jactitator's score distribution.Exist with the difference of Z-Norm In T-Norm is to calculate mean value and standard variance using a large amount of jactitator's speaker models rather than jactitator's voice data. Normalization process is carried out in identification, and tested speech data the speaker model claimed and are pretended to be simultaneously and largely Person's model is compared, and obtains jactitator's score respectively, and then calculates the distribution of jactitator's score and normalized parameter μ₁And σ₁。

3) using the mean value for the normalized value being calculated by Z-Norm and T-Norm as final normalized value, to be formed Score s' after likelihood ratio score s normalization.Wherein, z-norm and T-Norm is the prior art, is no longer repeated one by one herein.

Further, in another embodiment, in order to improve the confidence level and accuracy of recognition result, erroneous judgement is reduced Probability, S204 may include: the identification result that the audio-frequency information is determined based on the likelihood ratio score；Using likelihood ratio Whether verification method examines the identification result credible, and when verifying the identification credible result, exports identity Recognition result；Wherein, when the verifying identification result is insincere, S201 is returned to, or terminate process.

It is the schematic diagram of null hypothesis and alternative hypothesis that one embodiment of the invention provides also referring to Fig. 3, Fig. 3.

Specifically, verifying is considered as a kind of with null hypothesis H₀Hypothesis Testing Problem, wherein audio vector i have phase Same speaker and digital latent variable u_iAnd v_jAnd alternative hypothesis H₁.I, j is the positive integer more than or equal to 1.

Null hypothesis H₀In the corresponding number said of people；Alternative hypothesis H₁In a people correspond to multiple numbers, Huo Zheyi A corresponding multiple people of number or the multiple number mixing of multiple people.U₁、U₂、V₁、V₂Indicate different speakers.

Terminal can be verified by comparing different a possibility that assuming lower data as shown in Figure 3.Xt represents one People has said a number；Such as: i has said j number.It is that a people says that Xs, which represents this number,；Such as: j number is that i is said. ε_tIndicate the error of Xt, ε_sIndicate the error of Xs.

In null hypothesis H₁Under, feature Xt and Xs are mismatched.

According to hypothesis H₀, both sides judgement is consistent, feature Xt and Xs matching.

When feature Xt and Xs matching, the identification of audio-frequency information is determined the result is that accurate, believable, output identity is known Other result.

When feature Xt and Xs are mismatched, the identification of audio-frequency information is determined the result is that incredible, it is understood that there may be accidentally Sentence, return to S201 at this time or terminates the process of identification speaker's identity.

It further, can also include S205 after S204: when the identification result is credible, and identity school Test by when, respond the phonetic control command from the corresponding speaker of the audio-frequency information, and execute the voice control and refer to Enable corresponding predetermined registration operation.

Terminal is getting identification credible result, judges the identification result based on preset legal identity information Whether corresponding speaker is legitimate user, when the identification result is corresponding speak artificial legitimate user when, determine verification Pass through.Later, when getting the phonetic control command of speaker input, the phonetic control command is responded, the voice is obtained The corresponding predetermined registration operation of control instruction, and execute the corresponding predetermined registration operation of the phonetic control command.

For example, phonetic control command is search instruction, and the search instruction is for responding the search and referring to when searching for certain article It enables, from the relevant information of local data base or certain corresponding article of the network data library searching search instruction.

The embodiment of the present invention, by the loudspeaker latent variable and digital latent variable of extracting audio-frequency information to be identified；When When loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset Bayesian model and is handled, is obtained The likelihood ratio score of audio-frequency information；Identification result based on likelihood ratio score output audio-frequency information.Since preset requirement is Value based on loudspeaker latent variable corresponding to the clear and legible audio-frequency information recognized is configured, when the loudspeaker in audio-frequency information When latent variable meets preset requirement, interference of loudspeaker performance to numeric utterance to identification result itself can be excluded, this When each of said based on person under test number digital latent variable identification speaker identity information, due to the number of each number Latent variable can have multiple, and therefore, instant speaker pronounces for same numbers in different moments different, can also accurately identify The identity of speaker, can be avoided has different pronunciations and speaker couple for identical number because of different loudspeakers It is different in the pronunciation of different moments in same numbers, so that the case where interfering identification result, can be improved identification knot The accuracy of fruit.Likelihood ratio score based on audio-frequency information exports the identification of the audio-frequency information as a result, it is possible to reduce erroneous judgement Probability further increases the accuracy of recognition result.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.

Referring to Fig. 4, Fig. 4 is a kind of schematic diagram for terminal that one embodiment of the invention provides.The each unit that terminal includes For executing each step in the corresponding embodiment of FIG. 1 to FIG. 2.Referring specifically in the corresponding embodiment of FIG. 1 to FIG. 2 Associated description.For ease of description, only the parts related to this embodiment are shown.Referring to fig. 4, terminal 4 includes:

Acquiring unit 410 is directed to the audio-frequency information to be identified that benchmark numeric string is said for obtaining person under test；Wherein, The benchmark numeric string is to be stored in advance, and shuffle or random display, the audio-frequency information include that the person under test says The corresponding audio of numeric string；

Extraction unit 420, for extracting the loudspeaker latent variable and digital latent variable of the audio-frequency information；Wherein, institute Loudspeaker latent variable is stated for identifying the characteristic information of loudspeaker, the number latent variable is in the number for confirming that the person under test says Word string is extracted when identical as the benchmark numeric string, and the number latent variable is for identifying person under test described in the audio-frequency information To the pronunciation character of number；

Recognition unit 430, for when the loudspeaker latent variable meets preset requirement, the digital latent variable to be inputted Preset Bayesian model carries out Application on Voiceprint Recognition, obtains identification result；Wherein, the preset requirement is recognized based on clear and legible Audio-frequency information corresponding to the value of loudspeaker latent variable be configured；The Bayesian model is calculated by using machine learning The digital latent variable that method concentrates single speaker to say each number sample sound is trained to obtain；Each number latent variable With the identity label for identifying speaker belonging to the number latent variable；What the Bayesian model and the sample sound were concentrated The single speaker has corresponding relationship.

Further, extraction unit 420 includes:

First detection unit, for extracting loudspeaker latent variable from the audio-frequency information, and it is latent based on the loudspeaker Whether the value detection audio-frequency information of variable is qualified；

Second detection unit, for when the audio-frequency information qualification, based on the benchmark numeric string and described to be measured The corresponding audio of the numeric string that person says, detects numeric string that the person under test says and whether the benchmark numeric string is identical；

Latent variable extraction unit, for extracting digital latent variable from the audio-frequency information when testing result is identical.

Further, recognition unit 430 includes:

Computing unit handles for the digital latent variable to be inputted preset Bayesian model, obtains the sound The likelihood ratio score of frequency information；

Identity recognizing unit, for exporting the identification result of the audio-frequency information based on the likelihood ratio score.

Further, identity recognizing unit is specifically used for: using formula s'=(s- μ₁)/δ₁To the likelihood ratio score into Row normalized, and export based on the likelihood ratio score after the normalized identification knot of the audio-frequency information Fruit；Wherein, s' is the likelihood ratio score after the normalized, and s is the likelihood ratio score, μ₁It is false speaker's score The approximate average of distribution, δ₁It is the standard deviation of false speaker's score distribution.

Further, identity recognizing unit is specifically used for: the body of the audio-frequency information is determined based on the likelihood ratio score Part recognition result；It examines the identification result whether credible using likelihood ratio verification method, and knows verifying the identity When other credible result, identification result is exported；Wherein, when verify the identification result it is insincere when, terminate process or Person's acquiring unit 410 executes the acquisition person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said.

Fig. 5 be another embodiment of the present invention provides a kind of terminal schematic diagram.As shown in figure 5, the terminal 5 of the embodiment Include: processor 50, memory 51 and is stored in the calculating that can be run in the memory 51 and on the processor 50 Machine program 52.The processor 50 realizes the identification speaker of above-mentioned each terminal method when executing the computer program 52 Step in embodiment, such as S101 shown in FIG. 1 to S103.Alternatively, the processor 50 executes the computer program 52 The function of each unit in the above-mentioned each Installation practice of Shi Shixian, such as the function of unit 310 to 340 shown in Fig. 4.

Illustratively, the computer program 52 can be divided into one or more units, one or more of Unit is stored in the memory 51, and is executed by the processor 50, to complete the present invention.One or more of lists Member can be the series of computation machine program instruction section that can complete specific function, and the instruction segment is for describing the computer journey Implementation procedure of the sequence 52 in the terminal 4.For example, the computer program 52 can be divided into acquiring unit, extract list Member and recognition unit, each unit concrete function are as described above.

The terminal may include, but be not limited only to, processor 50, memory 51.It will be understood by those skilled in the art that figure 5 be only the example of terminal 5, and the not restriction of structure paired terminal 5 may include components more more or fewer than diagram, or Combine certain components or different components, for example, the terminal can also include input/output terminal, network insertion terminal, Bus etc..

Alleged processor 50 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng.

The memory 51 can be the internal storage unit of the terminal 5, such as the hard disk or memory of terminal 5.It is described Memory 51 is also possible to the external storage terminal of the terminal 5, such as the plug-in type hard disk being equipped in the terminal 5, intelligence Storage card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) Deng.Further, the memory 51 can also both including the terminal 5 internal storage unit and also including external storage end End.The memory 51 is for other programs and data needed for storing the computer program and the terminal.It is described to deposit Reservoir 51 can be also used for temporarily storing the data that has exported or will export.

Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations；Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims

1. a kind of method for identifying speaker characterized by comprising

It obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said；Wherein, the benchmark numeric string is preparatory Storage, and shuffle or random display, the audio-frequency information include the corresponding audio of numeric string that the person under test says；

Extract the loudspeaker latent variable and digital latent variable of the audio-frequency information；Wherein, the loudspeaker latent variable is for marking Know the characteristic information of loudspeaker, the number latent variable is in the numeric string and the benchmark numeric string for confirming that the person under test says It is extracted when identical, the number latent variable is used to identify person under test described in the audio-frequency information to the pronunciation character of number；

When the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset Bayesian model and is carried out Application on Voiceprint Recognition obtains identification result；Wherein, the preset requirement corresponding to the clear and legible audio-frequency information recognized based on raising The value of sound device latent variable is configured；The Bayesian model is single to sample sound concentration by using machine learning algorithm The digital latent variable that speaker says each number is trained to obtain, and each digital latent variable has the mark number latent The identity label of speaker belonging to variable；The single speaker that the Bayesian model and the sample sound are concentrated has There is corresponding relationship.

2. the method according to claim 1, wherein the loudspeaker latent variable for extracting the audio-frequency information with And digital latent variable, comprising:

Loudspeaker latent variable is extracted from the audio-frequency information, and is based on the value of loudspeaker latent variable detection audio-frequency information No qualification；

When the audio-frequency information qualification, the corresponding sound of numeric string said based on the benchmark numeric string and the person under test Frequently, it detects numeric string that the person under test says and whether the benchmark numeric string is identical；

When testing result is identical, digital latent variable is extracted from the audio-frequency information.

3. method according to claim 1 or 2, which is characterized in that described when the loudspeaker latent variable meets default want When asking, the digital latent variable is inputted into preset Bayesian model and carries out Application on Voiceprint Recognition, obtains identification result, comprising:

The digital latent variable is inputted preset Bayesian model to handle, obtains the likelihood score of the audio-frequency information Number；

The identification result of the audio-frequency information is exported based on the likelihood ratio score.

4. according to the method described in claim 3, it is characterized in that, described export the audio letter based on the likelihood ratio score The identification result of breath, comprising:

Using formula s'=(s- μ₁)/δ₁The likelihood ratio score is normalized, and based on after the normalized Likelihood ratio score export the identification result of the audio-frequency information；Wherein, s' is the likelihood ratio after the normalized Score, s are the likelihood ratio score, μ₁It is the approximate average of false speaker's score distribution, δ₁It is false speaker's score point The standard deviation of cloth.

5. according to the method described in claim 3, it is characterized in that, described export the audio letter based on the likelihood ratio score The identification result of breath, comprising:

The identification result of the audio-frequency information is determined based on the likelihood ratio score；

Examine the identification result whether credible using likelihood ratio verification method, and can verifying the identification result When letter, the identification result is exported；Wherein, when verify the identification result it is insincere when, return it is described obtain to Survey person is directed to the audio-frequency information to be identified that benchmark numeric string is said, or terminates.

6. a kind of terminal characterized by comprising

Acquiring unit is directed to the audio-frequency information to be identified that benchmark numeric string is said for obtaining person under test；Wherein, the benchmark Numeric string is to be stored in advance, and shuffle or random display, the audio-frequency information include the numeric string that the person under test says Corresponding audio；

Extraction unit, for extracting the loudspeaker latent variable and digital latent variable of the audio-frequency information；Wherein, the loudspeaker Latent variable is used to identify the characteristic information of loudspeaker, and the number latent variable is in the numeric string and institute for confirming that the person under test says State benchmark numeric string it is identical when extraction, it is described number latent variable be used for identify person under test described in the audio-frequency information to number Pronunciation character；

Recognition unit, for the digital latent variable being inputted preset when the loudspeaker latent variable meets preset requirement Bayesian model carries out Application on Voiceprint Recognition, obtains identification result；Wherein, the Bayesian model is by using machine learning The digital latent variable that algorithm concentrates single speaker to say each number sample sound is trained to obtain；Each number Latent variable has the identity label for identifying the corresponding affiliated speaker of the number latent variable；The Bayesian model and the sound The single speaker in sample set has corresponding relationship.

7. a kind of terminal, which is characterized in that the terminal include memory, processor and storage in the memory and can The computer program run on the processor, the processor realize following steps when executing the computer program:

When the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset Bayesian model and is carried out Application on Voiceprint Recognition obtains identification result；Wherein, the preset requirement corresponding to the clear and legible audio-frequency information recognized based on raising The value of sound device latent variable is configured；The Bayesian model is single to sample sound concentration by using machine learning algorithm The digital latent variable that speaker says each number is trained to obtain；Each digital latent variable has the mark number latent The identity label of speaker belonging to variable；The single speaker that the Bayesian model and the sample sound are concentrated has There is corresponding relationship.

8. terminal according to claim 7, which is characterized in that the loudspeaker latent variable for extracting the audio-frequency information with And digital latent variable, comprising:

9. terminal according to claim 8, which is characterized in that it is described when loudspeaker latent variable meets the requirements, it will be described Digital latent variable inputs preset Bayesian model and carries out Application on Voiceprint Recognition, obtains identification result, comprising:

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In when the computer program is executed by processor the step of any one of such as claim 1 to 5 of realization the method.