CN110111798A - A kind of method and terminal identifying speaker - Google Patents
A kind of method and terminal identifying speaker Download PDFInfo
- Publication number
- CN110111798A CN110111798A CN201910354414.7A CN201910354414A CN110111798A CN 110111798 A CN110111798 A CN 110111798A CN 201910354414 A CN201910354414 A CN 201910354414A CN 110111798 A CN110111798 A CN 110111798A
- Authority
- CN
- China
- Prior art keywords
- latent variable
- audio
- frequency information
- loudspeaker
- numeric string
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/12—Score normalisation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/18—Artificial neural networks; Connectionist approaches
Abstract
The present invention is suitable for field of computer technology, provides a kind of method and terminal for identifying speaker, this method comprises: obtaining person under test is directed to the audio-frequency information to be identified that benchmark numeric string is said;Audio-frequency information includes numeric string;Extract the loudspeaker latent variable and digital latent variable of audio-frequency information;Loudspeaker latent variable is used to identify the characteristic information of loudspeaker, and digital latent variable is used to identify in audio-frequency information person under test to the pronunciation character of number;When loudspeaker latent variable meets preset requirement, digital latent variable is inputted into preset Bayesian model and carries out Application on Voiceprint Recognition, obtains identification result.The embodiment of the present invention, identity information based on loudspeaker latent variable and digital latent variable identification speaker in audio-frequency information, can be avoided has different pronunciations for identical number because of different loudspeakers, and speaker is different in the pronunciation of different moments for same numbers, the case where to interfere identification result, it can be improved the accuracy of identification result.
Description
Technical field
The invention belongs to field of computer technology more particularly to a kind of methods and terminal for identifying speaker.
Background technique
With the fast development of information technology and network technology, demand of the people to identity recognizing technology is more and more.Base
Many shortcomings (such as security reliability has been exposed in practical applications in the identity recognizing technology of conventional cipher certification
It is lower), and it is also increasingly mature in recent years and show it in practical applications based on the identity recognizing technology that biological characteristic distinguishes
Superiority.Wherein, sound groove recognition technology in e is one of the identity recognizing technology distinguished based on biological characteristic.
Vocal print refers to the hum pattern of speaker's voice spectrum.Since everyone vocal organs are different, the sound issued
Sound and its tone are different, and therefore, identification is carried out using vocal print as essential characteristic has irreplaceability and stability.
Application on Voiceprint Recognition has two kinds of text relevant (Text-Dependent) and unrelated (Text-Independent) of text.
To the unrelated Speaker Identification of the free text of voice content on the contrary, the relevant speaker verification's system of text
It is more advantageous to security application, because they tend to show higher accuracy in session in short-term.
Typical text correlation Speaker Identification is to allow each user using fixed phrase, short to match registration and test
Language.In this case, it can then be played with pre-recorded language from the user.In training and language is tested with not
In the case where with scene, by sharing identical voice content, the safety of identification protection can be improved to a certain extent, is
System can provide some numeric strings at random, and user, which need to correctly read out corresponding content, just can recognize vocal print, the introducing of this randomness
So that collected vocal print has the difference on content timeline each time in text Classical correlation.However, using random prompt
When connected digits recognition speaker, since part sample digit vocabulary is fixed and numeral sample is limited, and for identical number,
Different loudspeakers has different pronunciations, will lead to the identity that can not accurately identify speaker in this way.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of methods and terminal for identifying speaker, to solve the prior art
In, for complicated vocal print voice messaging (for example, short speech, imitate voice etc.), the Voiceprint Recognition System of text independent type without
Method accurately extracts the phonetic feature of speaker, the problem of so as to cause the identity that can not accurately identify speaker.
The first aspect of the embodiment of the present invention provides a kind of method for identifying speaker, comprising:
It obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said;Wherein, the benchmark numeric string is
It is stored in advance, and shuffle or random display, the audio-frequency information includes the corresponding sound of numeric string that the person under test says
Frequently;
Extract the loudspeaker latent variable and digital latent variable of the audio-frequency information;Wherein, the loudspeaker latent variable is used
In the characteristic information of mark loudspeaker, the number latent variable is in the numeric string and the base value for confirming that the person under test says
It is extracted when word string is identical, it is special to the pronunciation of number that the number latent variable is used to identify person under test described in the audio-frequency information
Sign;
When the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset Bayesian model
Application on Voiceprint Recognition is carried out, identification result is obtained;Wherein, the preset requirement is based on corresponding to the clear and legible audio-frequency information recognized
The value of loudspeaker latent variable be configured;The Bayesian model is to be concentrated by using machine learning algorithm to sample sound
The digital latent variable that single speaker says each number is trained to obtain;Each digital latent variable, which has, identifies the number
The identity label of speaker belonging to word latent variable;What the Bayesian model and the sample sound were concentrated described single speaks
Person has corresponding relationship.
The second aspect of the embodiment of the present invention provides a kind of terminal, comprising:
Acquiring unit is directed to the audio-frequency information to be identified that benchmark numeric string is said for obtaining person under test;Wherein, described
Benchmark numeric string is to be stored in advance, and shuffle or random display, the audio-frequency information include the number that the person under test says
The corresponding audio of word string;
Extraction unit, for extracting the loudspeaker latent variable and digital latent variable of the audio-frequency information;Wherein, described to raise
Sound device latent variable is used to identify the characteristic information of loudspeaker, and the number latent variable is in the numeric string for confirming that the person under test says
It is extracted when identical as the benchmark numeric string, the number latent variable is for identifying person under test's logarithm described in the audio-frequency information
The pronunciation character of word;
Recognition unit, for the digital latent variable being inputted pre- when the loudspeaker latent variable meets preset requirement
If Bayesian model carry out Application on Voiceprint Recognition, obtain identification result;Wherein, the Bayesian model is by using machine
The digital latent variable that learning algorithm concentrates single speaker to say each number sample sound is trained to obtain;It is described each
Digital latent variable has the identity label for identifying the corresponding affiliated speaker of the number latent variable;The Bayesian model with it is described
The single speaker that sample sound is concentrated has corresponding relationship.
The third aspect of the embodiment of the present invention provides a kind of terminal, including memory, processor and is stored in described
In memory and the computer program that can run on the processor, the processor are realized when executing the computer program
Following steps:
It obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said;Wherein, the benchmark numeric string is
It is stored in advance, and shuffle or random display, the audio-frequency information includes the corresponding sound of numeric string that the person under test says
Frequently;
Extract the loudspeaker latent variable and digital latent variable of the audio-frequency information;Wherein, the loudspeaker latent variable is used
In the characteristic information of mark loudspeaker, the number latent variable is in the numeric string and the base value for confirming that the person under test says
It is extracted when word string is identical, it is special to the pronunciation of number that the number latent variable is used to identify person under test described in the audio-frequency information
Sign;
When the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset Bayesian model
Application on Voiceprint Recognition is carried out, identification result is obtained;Wherein, the preset requirement is based on corresponding to the clear and legible audio-frequency information recognized
The value of loudspeaker latent variable be configured;The Bayesian model is to be concentrated by using machine learning algorithm to sample sound
The digital latent variable that single speaker says each number is trained to obtain;Each digital latent variable, which has, identifies the number
The identity label of speaker belonging to word latent variable;What the Bayesian model and the sample sound were concentrated described single speaks
Person has corresponding relationship.
The fourth aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage
Media storage has computer program, and the computer program performs the steps of when being executed by processor
It obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said;Wherein, the benchmark numeric string is
It is stored in advance, and shuffle or random display, the audio-frequency information includes the corresponding sound of numeric string that the person under test says
Frequently;
Extract the loudspeaker latent variable and digital latent variable of the audio-frequency information;Wherein, the loudspeaker latent variable is used
In the characteristic information of mark loudspeaker, the number latent variable is in the numeric string and the base value for confirming that the person under test says
It is extracted when word string is identical, it is special to the pronunciation of number that the number latent variable is used to identify person under test described in the audio-frequency information
Sign;
When the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset Bayesian model
Application on Voiceprint Recognition is carried out, identification result is obtained;Wherein, the preset requirement is based on corresponding to the clear and legible audio-frequency information recognized
The value of loudspeaker latent variable be configured;The Bayesian model is to be concentrated by using machine learning algorithm to sample sound
The digital latent variable that single speaker says each number is trained to obtain;Each digital latent variable, which has, identifies the number
The identity label of speaker belonging to word latent variable;What the Bayesian model and the sample sound were concentrated described single speaks
Person has corresponding relationship.
Implement a kind of method for identifying speaker and terminal provided in an embodiment of the present invention to have the advantages that
The embodiment of the present invention, by the loudspeaker latent variable and digital latent variable of extracting audio-frequency information to be identified;When
When loudspeaker latent variable meets the requirements, digital latent variable is inputted into preset Bayesian model and carries out Application on Voiceprint Recognition, obtains identity
Recognition result.Since preset requirement is that the value based on loudspeaker latent variable corresponding to the clear and legible audio-frequency information recognized is set
It sets, when the loudspeaker latent variable in audio-frequency information meets preset requirement, loudspeaker performance itself can be excluded to numeric utterance
The identity of the digital latent variable identification speaker of number each of is said in interference to identification result based on person under test at this time
Information, due to the digital latent variable of each number can have it is multiple, instant speaker is in different moments for same numbers
Pronunciation is different, can also accurately identify the identity of speaker, can be avoided has identical number because of different loudspeakers
Different pronunciations and speaker is different in the pronunciation of different moments for same numbers, to interfere identification result
Situation can be improved the accuracy of identification result.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art
Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only of the invention some
Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these
Attached drawing obtains other attached drawings.
Fig. 1 is a kind of implementation flow chart of the method for identification speaker that one embodiment of the invention provides;
Fig. 2 be another embodiment of the present invention provides a kind of identification speaker method implementation flow chart;
Fig. 3 is the schematic diagram of the null hypothesis that one embodiment of the invention provides and alternative hypothesis;
Fig. 4 is a kind of schematic diagram for terminal that one embodiment of the invention provides;
Fig. 5 be another embodiment of the present invention provides a kind of terminal schematic diagram.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
Referring to Figure 1, Fig. 1 is a kind of implementation flow chart of method for identifying speaker provided in an embodiment of the present invention.This
The executing subject that the method for speaker is identified in embodiment is terminal.Terminal include but is not limited to smart phone, tablet computer, can
The mobile terminals such as wearable device can also be desktop computer etc..The method of identification speaker as shown in the figure can include:
S101: it obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said;Wherein, the benchmark number
String is to be stored in advance, and shuffle or random display, the audio-frequency information include the numeric string correspondence that the person under test says
Audio.
Terminal when detecting Speaker Identification instruction, can by built-in sound pickup device (for example, microphone,
Loudspeaker) audio-frequency information to be identified that the speaker in ambient enviroment issues is obtained, at this point, the audio-frequency information that speaker issues
It is that the benchmark numeric string provided at random according to terminal issues;Or terminal according to Speaker Identification instruct in include file identification
It obtains this document and identifies corresponding audio file or video file, and extract the audio-frequency information in audio file or video file,
It is identified as audio-frequency information to be identified.Read out what benchmark numeric string obtained comprising person under test in audio file or video file
Audio-frequency information, audio file or video file can be user's upload, can also be from for storing audio file or video text
Downloading obtains in the server of part.Benchmark numeric string is to be stored in advance in the terminal, and by terminal shuffle or random display.
The quantity of benchmark numeric string may include multiple.
Audio-frequency information to be identified includes the corresponding audio of numeric string, and numeric string is made of at least one number.
S102: the loudspeaker latent variable and digital latent variable of the audio-frequency information are extracted;Wherein, the loudspeaker creep
The extraction when the numeric string for confirming that the person under test says is identical as the benchmark numeric string is measured, the number latent variable is for marking
Know the characteristic information of loudspeaker, the number latent variable is used to identify pronunciation of the person under test described in the audio-frequency information to number
Feature.
Terminal calculates the loudspeaker variable of audio-frequency information based on the audio-frequency information got.Loudspeaker latent variable include but
It is not limited to signal-to-noise ratio, can also include efficiency, the sound pressure level etc. of loudspeaker.
Signal-to-noise ratio (signal-to-noise ratio) is the proportionate relationship for describing effective component and noise contribution in signal
Parameter.Wherein, the signal-to-noise ratio of loudspeaker is higher, and the sound that loudspeaker picks up is more clear.For example, terminal is extracted from audio-frequency information
Noise signal when normal voice signal and no signal, is based on normal voice signal and noise signal, calculates audio
The signal-to-noise ratio of information.
The audio-frequency information that terminal can will acquire inputs trained deep neural network (Deep Neural in advance
Networks, DNN) model, the digital latent variable of each number in audio-frequency information is extracted by deep neural network.Digital creep
Measure the pronunciation character for identifying same number.The same number can have at least two different digital latent variables, that is, have
There are at least two different pronunciations.For example, the pronunciation of digital " 1 " includes " one ", " " etc..
In the present embodiment, deep neural network model includes input layer, hidden layer and output layer.Input layer includes one defeated
Enter node layer, for receiving the audio-frequency information inputted from external.Hidden layer includes more than two hidden layer nodes, for sound
Frequency information is handled, and the digital latent variable of audio-frequency information is extracted.Output layer exports audio-frequency information for exporting processing result
Digital latent variable.
Deep neural network model is based on sample sound training and gets, and sample sound concentration is said each including speaker
The corresponding sample sound of number.The number of speaker can be multiple, such as 500 or 1500, trained sample to a certain extent
This quantity is more, as a result more accurate when being identified using the neural network model that training obtains.It include default in sample sound
The sample sound of number, each markd digital latent variable of sample sound tool, each number latent variable and sample label are one by one
It is corresponding.Sample sound can only include a number, and each number that sample sound is concentrated can be corresponding at least two numbers
Latent variable.The number that sample sound includes includes the number that all benchmark numeric strings that may be generated at random include.For example, random
When the benchmark numeric string of generation includes any 6 numbers in 0~9,10 numbers, it includes 0~9 this 10 that sample sound, which is concentrated,
The sample sound of number.
During training, the input data for inputting deep neural network model is audio-frequency information, which can
To be the vector matrix obtained based on audio-frequency information, vector matrix from audio-frequency information by successively extracting comprising digital audio
The corresponding vector composition of data.The output data of deep neural network model is the digital creep of each number in audio-frequency information
Amount.
It is understood that the corresponding neural network recognization model of speaker, when needing to identify multiple speakers
When, the multiple corresponding neural network recognization models of speaker of training.
Terminal judges whether the loudspeaker latent variable extracted meets when extracting loudspeaker latent variable, based on preset condition
Preset requirement.For preset condition for judging whether the sound in audio-frequency information can clearly be recognized, preset condition can be based on clear
The value of loudspeaker latent variable corresponding to the clear sound that can be recognized is configured.For example, when loudspeaker latent variable is signal-to-noise ratio
When, preset condition can be signal-to-noise ratio and be greater than or equal to default snr threshold.Default snr threshold is that can clearly recognize
Sound corresponding to signal-to-noise ratio.
When loudspeaker latent variable meets the requirements, S103 is executed;When loudspeaker latent variable is undesirable, can return
S101, can also export identification result is that current speaker speaker corresponding with benchmark numeric string mismatches.
S103: when the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset pattra leaves
This model carries out Application on Voiceprint Recognition, obtains identification result;Wherein, the preset requirement is based on the clear and legible audio-frequency information recognized
The value of corresponding loudspeaker latent variable is configured;The Bayesian model is by using machine learning algorithm to sound sample
This digital latent variable for concentrating single speaker to say each number is trained to obtain;Each digital latent variable has mark
Know the identity label of speaker belonging to the number latent variable;The list that the Bayesian model and the sample sound are concentrated
One speaker has corresponding relationship.
For example, when speaker, which says certain number, has the number latent variable, then, the identity label of the number latent variable
It indicates: saying the identity of the speaker of the number.
In the present embodiment, the expression formula of preset Bayesian model is as follows:
p(xijk|ui,vj, θ) and=N (xijk|μ+ui+vj,∑ε);Wherein, p (ui)=N (ui|0,∑u);p(vj)=N (vj|
0,∑v)。
Wherein, p (xijk|ui,vj, θ) and indicate that a people has said a digital probability, xijkIt indicates the in k-th of session
I people has said j-th of number.When due to doing authentication or typing information, it is repeatedly different may to may require that speaker says
Numeric string, therefore, k indicate k-th of session, and θ is the general designation of parameter in Bayesian model.Conditional probability N (xijk|μ+ui+vj,
∑ε) indicate that meeting mean value is μ+ui+vj, variance is ∑εGaussian Profile;∑εIndicate the diagonal covariance of ε;Signal component xijk
=μ+ui+vj+εijk, i.e. representation signal component xijkDepending on speaker and number;Noise component(s) εijkIt indicates in k-th of session
I-th of people says the deviation or noise of j-th of number, and μ indicates the population mean of training vector.
uiIndicate the digital latent variable of speaker i, uiIt is defined as diagonal covariance ∑uGauss, p (ui) table
Show that speaker is the probability of i, N (ui|0,∑u) it is about ∑uGaussian Profile;vjIndicate the digital latent variable of speaker j,
It is vjIt is defined as diagonal covariance ∑vGauss, p (vj) indicate be speaker j probability, N (vj|0,∑v) be about
∑vGaussian Profile.
In form, Bayesian model can describe N (x | μ, Σ) with conditional probability indicates Gauss in x, has average μ
With covariance Σ.
Terminal is when completing iterative calculation speaker and having said in audio-frequency information each of to include number respectively corresponding probability, base
Number corresponding probability each of is obtained in each iterative calculation, identification is carried out to speaker, obtain identification result.
For example, iteration total degree is 10 times, speaker i has said that a digital probability is greater than or equal to predetermined probabilities threshold value
When (such as 0.8), remembers 1 point, when speaker i has said that a digital probability is less than predetermined probabilities threshold value (such as 0.8), remembers 0 point,
The total score for counting speaker i after iterating to calculate 10 times, when total score is greater than or equal to default point threshold (such as 7 points),
Determine the corresponding artificial speaker i that speaks of audio-frequency information.
The embodiment of the present invention, by the loudspeaker latent variable and digital latent variable of extracting audio-frequency information to be identified;When
When loudspeaker latent variable meets the requirements, digital latent variable is inputted into preset Bayesian model and carries out Application on Voiceprint Recognition, obtains identity
Recognition result.Since preset requirement is that the value based on loudspeaker latent variable corresponding to the clear and legible audio-frequency information recognized is set
Set, when the loudspeaker latent variable in audio-frequency information meets preset requirement, can exclude because loudspeaker itself to numeric utterance because
Difference and interference to identification result, the digital latent variable identification for each of saying number based on person under test at this time are spoken
The identity information of people, due to the digital latent variable of each number can have it is multiple, instant speaker different moments for
Same numbers pronunciation is different, can also accurately identify the identity of speaker, can be avoided because different loudspeakers is for identical
Number has different pronunciations and speaker different in the pronunciation of different moments for same numbers, so that identity be interfered to know
The case where other result, can be improved the accuracy of identification result.
Refer to Fig. 2, Fig. 2 be another embodiment of the present invention provides a kind of identification speaker method implementation process
Figure.The executing subject that the method for speaker is identified in the present embodiment is terminal.Terminal includes but is not limited to smart phone, plate electricity
The mobile terminals such as brain, wearable device can also be desktop computer etc..The method of the identification speaker of the present embodiment includes following
Step:
S201: it obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said;Wherein, the benchmark number
String is to be stored in advance, and shuffle or random display, the audio-frequency information include the numeric string correspondence that the person under test says
Audio.
S201 is identical as S101 in a upper embodiment in the present embodiment, referring specifically to the correlation of S101 in a upper embodiment
Description, does not repeat herein.
S202: the loudspeaker latent variable and digital latent variable of the audio-frequency information are extracted;Wherein, the loudspeaker creep
The extraction when the numeric string for confirming that the person under test says is identical as the benchmark numeric string is measured, the number latent variable is for marking
Know the characteristic information of loudspeaker, the number latent variable is used to identify pronunciation of the person under test described in the audio-frequency information to number
Feature.
S202 is identical as S102 in a upper embodiment in the present embodiment, referring specifically to the correlation of S102 in a upper embodiment
Description, does not repeat herein.
Further, S202 may include S2021~S2023.It is specific as follows:
S2021: loudspeaker latent variable is extracted from the audio-frequency information, and is detected based on the value of the loudspeaker latent variable
Whether audio-frequency information is qualified.
Specifically, terminal can extract loudspeaker latent variable, and the loudspeaking based on extraction from audio-frequency information to be identified
The value of device latent variable, preset loudspeaker latent variable threshold value, whether detection audio-frequency information is qualified, to confirm the sound in audio-frequency information
Whether sound can clearly be recognized.Loudspeaker latent variable is used to identify the characteristic information of loudspeaker.Loudspeaker latent variable includes but not
It is limited to signal-to-noise ratio, can also includes efficiency, the sound pressure level etc. of loudspeaker.Preset loudspeaker latent variable threshold value is to be based on clearly may be used
The value of loudspeaker latent variable corresponding to the audio-frequency information of identification is configured.
Signal-to-noise ratio is the proportionate relationship parameter for describing effective component and noise contribution in signal.The signal-to-noise ratio of loudspeaker is got over
Height, the sound that loudspeaker picks up are more clear.For example, when terminal extracts normal voice signal and no signal from audio-frequency information
Noise signal, be based on normal voice signal and noise signal, calculate the signal-to-noise ratio of audio-frequency information.
For example, terminal is greater than or waits in the value for the signal-to-noise ratio for detecting loudspeaker when loudspeaker latent variable is signal-to-noise ratio
When 70, determines that audio-frequency information to be identified is qualified, can clearly recognize the number in audio-frequency information.
When testing result is that audio-frequency information is unqualified, speaker is prompted to read random benchmark numeric string again, to obtain again
Take audio-frequency information;Or audio-frequency information to be identified is reacquired from the corresponding database of storage audio data.Wherein, benchmark
Numeric string is that terminal is generated at random during identifying speaker's identity or obtained at random from database, and is prompted to user
Numeric string, terminal terminal shuffle or can show the benchmark numeric string before S101, when the side using voice broadcast
When formula playback references numeric string, with standard pronunciation playback references numeric string.Benchmark numeric string includes the number of preset number, example
It such as, include 5 or 6 numbers in benchmark numeric string.
When testing result is audio-frequency information qualification, S2022 is executed.
S2022: when the audio-frequency information qualification, the number said based on the benchmark numeric string and the person under test
It goes here and there corresponding audio, detects numeric string that the person under test says and whether the benchmark numeric string is identical.
When testing result is audio-frequency information qualification, terminal can successively extract the voice comprising number from audio-frequency information
Segment, the numeric string for being included using speech recognition technology identification audio-frequency information, by benchmark numeric string and the numeric string identified
It is compared, judges whether numeric string that audio-frequency information includes and benchmark numeric string are identical.
Alternatively, terminal can also play the benchmark numeric string, the corresponding audio of benchmark numeric string is obtained, by benchmark numeric string
Corresponding audio audio corresponding with the numeric string that person under test says is compared, the numeric string and benchmark that detection person under test says
Whether numeric string is identical.The audio obtained by playing the benchmark numeric string, it is corresponding with the numeric string that person under test says is collected
Audio be compared, terminal pick up audio and play audio-frequency information when, it is possible to reduce because of the performance logarithm of loudspeaker itself
Word pronunciation deviation.
When any number in the numeric string identified it is different from any number in benchmark numeric string or it is multiple number
When putting in order not identical, the numeric string for determining that audio-frequency information includes is different from benchmark numeric string.At this point it is possible to return to S201 weight
Newly obtain voice data to be identified;It is that current speaker is corresponding with benchmark numeric string that identification result, which can also be exported,
Speaker mismatches.
When each of benchmark numeric string number is identical with each number in the numeric string identified, and it is multiple digital
When putting in order also identical, the numeric string for determining that audio-frequency information includes is identical as benchmark numeric string, executes S2023.
S2023: when testing result is identical, digital latent variable is extracted from the audio-frequency information.
Audio-frequency information is converted into matrix when the numeric string for confirming that audio-frequency information includes is identical as benchmark numeric string by terminal
Vector input DNN model is handled, and the digital latent variable of each of numeric string number is extracted from audio-frequency information.Number is latent
Variable is used to identify person under test in the audio-frequency information to the pronunciation character of number.
Specifically, the audio-frequency information that terminal can will acquire inputs trained DNN model in advance, passes through depth nerve
Network extracts the digital latent variable of each number in audio-frequency information.Digital latent variable is used to identify the pronunciation character of same number.
The same number can have at least two different digital latent variables, that is, have at least two different pronunciations.For example, digital
The pronunciation of " 1 " includes " one ", " " etc..
In the present embodiment, deep neural network model includes input layer, hidden layer and output layer.Input layer includes one defeated
Enter node layer, for receiving the audio-frequency information inputted from external.Hidden layer includes more than two hidden layer nodes, for sound
Frequency information is handled, and the digital latent variable of audio-frequency information is extracted.Output layer exports audio-frequency information for exporting processing result
Digital latent variable.
Deep neural network model is based on sample sound training and gets, and sample sound concentration is said each including speaker
The corresponding sample sound of number.The number of speaker can be multiple, such as 500 or 1500, trained sample to a certain extent
This quantity is more, as a result more accurate when being identified using the neural network model that training obtains.It include default in sample sound
The sample sound of number, each markd digital latent variable of sample sound tool, each number latent variable and sample label are one by one
It is corresponding.Sample sound can only include a number, and each number that sample sound is concentrated can be corresponding at least two numbers
Latent variable.The number that sample sound includes includes the number that all benchmark numeric strings that may be generated at random include.For example, random
When the benchmark numeric string of generation includes any 6 numbers in 0~9,10 numbers, it includes 0~9 this 10 that sample sound, which is concentrated,
The sample sound of number.
During training, the input data for inputting deep neural network model is audio-frequency information, which can
To be the vector matrix obtained based on audio-frequency information, vector matrix from audio-frequency information by successively extracting comprising digital audio
The corresponding vector composition of data.The output data of deep neural network model is the digital creep of each number in audio-frequency information
Amount.
It is understood that the corresponding neural network recognization model of speaker, when needing to identify multiple speakers
When, the multiple corresponding neural network recognization models of speaker of training.
S203: when the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset pattra leaves
This model is handled, and the likelihood ratio score of the audio-frequency information is obtained;Wherein, the preset requirement is recognized based on clear and legible
The value of loudspeaker latent variable corresponding to audio-frequency information is configured;The Bayesian model is by using machine learning algorithm
The digital latent variable for concentrating single speaker to say each number sample sound is trained to obtain;Each digital creep
Measurer has the identity label for identifying speaker belonging to the number latent variable;The Bayesian model and the sample sound are concentrated
The single speaker have corresponding relationship.
The digital latent variable that audio-frequency information each of includes number is inputted preset Bayesian model by terminal, and by with
Lower formula p (xijk|ui,vj, θ) and=N (xijk|μ+ui+vj,∑ε) calculate speaker said each number for including in audio-frequency information
Respectively corresponding probability.Wherein, p (ui)=N (ui|0,∑u);p(vj)=N (vj|0,∑v)。
Wherein, p (xijk|ui,vj, θ) and indicate that a people has said a digital probability, xijkIt indicates the in k-th of session
I people has said j-th of number.When due to doing authentication or typing information, it is repeatedly different may to may require that speaker says
Numeric string, therefore, k indicate k-th of session, and θ is the general designation of parameter in Bayesian model.Conditional probability N (xijk|μ+ui+vj,
∑ε) indicate that meeting mean value is μ+ui+vj, variance is ∑εGaussian Profile;∑εIndicate the diagonal covariance of ε;Signal component xijk
=μ+ui+vj+εijk, i.e. representation signal component xijkDepending on speaker and number;Noise component(s) εijkIt indicates in k-th of session
I-th of people says the deviation or noise of j-th of number, and μ indicates the population mean of training vector.
uiIndicate the digital latent variable of speaker i, uiIt is defined as diagonal covariance ∑uGauss, p (ui) table
Show that speaker is the probability of i, N (ui|0,∑u) it is about ∑uGaussian Profile;vjIndicate the digital latent variable of speaker j,
It is vjIt is defined as diagonal covariance ∑vGauss, p (vj) indicate be speaker j probability, N (vj|0,∑v) be about
∑vGaussian Profile.
In form, Bayesian model can describe N (x | μ, Σ) with conditional probability indicates Gauss in x, has average μ
With covariance Σ.
Terminal is when calculating speaker and having said in audio-frequency information each of includes number respectively corresponding probability, based on calculating
Obtained probability calculation likelihood ratio score.
Likelihood ratio fraction representation speaker i has said the probability of j-th of number and has not been that speaker i has said j-th of number
The ratio between probability.For example, not being that speaker i has said j-th of number if speaker i has said that the probability of j-th of number is 0.6
The probability of word is 0.4, and likelihood ratio score is 0.6/0.4=1.5.
It has said in audio-frequency information and has wrapped it is understood that terminal can (for example) be calculated speaker by preset the number of iterations
Each of contain number respectively corresponding probability, wherein preset the number of iterations can be 10 times, can also be set according to actual needs
It sets.
At this point, terminal can repeatedly say j-th of corresponding likelihood ratio score of number based on speaker, calculating is spoken
People has said the average likelihood score of j-th of number, and has said j-th of number using the likelihood score ratio that is averaged as speaker is calculated
The likelihood ratio score of word.
In order to improve the accuracy of identification result, terminal is also based on speaker and has repeatedly said that j-th of number is each
Self-corresponding likelihood ratio score filters out the likelihood ratio score for being greater than or equal to default likelihood ratio score threshold, and calculating sifting
The mean value of likelihood ratio score out has said the likelihood ratio score of j-th of number using the mean value being calculated as speaker.It is default
Likelihood ratio score threshold can be 1.2, and but it is not limited to this, can specifically be configured according to the actual situation, herein with no restrictions.
S204: the identification result of the audio-frequency information is exported based on the likelihood ratio score.
Since likelihood ratio fraction representation speaker i has said the probability of j-th of number and is not that speaker i has said j-th of number
The ratio between probability of word, therefore, terminal can determine when confirming that speaker i has said the likelihood ratio score of j-th of number greater than 1
The artificial speaker i that speaks of audio-frequency information.When speaker i has said the likelihood ratio score of j-th of number less than or equal to 1, sentence
Determining speaking artificially for audio-frequency information is not speaker i.
Further, in one embodiment, for the ease of using common likelihood ratio score threshold to identify the person of speaking
Part, to improve recognition efficiency.S204 may include: using formula s'=(s- μ1)/δ1The likelihood ratio score is normalized
It handles, and exports the identification result of the audio-frequency information based on the likelihood ratio score after the normalized;Wherein, s'
For the likelihood ratio score after the normalized, s is the likelihood ratio score, μ1It is the approximation of false speaker's score distribution
Average value, δ1It is the standard deviation of false speaker's score distribution.
Likelihood ratio score from different speakers is converted to similar range by using above formula by terminal, so as to
Judged using common likelihood ratio score threshold.Wherein μ1And σ1It is that the approximation that false speaker's score is distributed is average respectively
Value and standard deviation.Wherein it is possible to likelihood ratio score is normalized by following three kinds of method for normalizing:
1) zero normalization (Z-Norm) calculates average μ for the non-targeted language of object module using a batch1And standard
Poor σ1;That is, emitting the mean value and variance and the regular scheme for being distributed progress linear transformation to score that the person's of recognizing score is distributed by estimation.
Z-Norm is directed to some speaker's probabilistic model, is tested using the large quantities of person's of recognizing speech utterance data that emit, obtains jactitator
Score distribution.Then, from pretend to be score be distributed in mean μ is calculated1And standard deviation sigma1, then by μ1、σ1Substitute into formula s'=
(s-μ1)/δ1Likelihood ratio score after calculating normalized.
2) test normalization (T-Norm) counts one group of false speaker models using the feature vector of unknown speaker
Calculate statistical data;It is normalized namely based on the mean value and standard deviation of jactitator's score distribution.Exist with the difference of Z-Norm
In T-Norm is to calculate mean value and standard variance using a large amount of jactitator's speaker models rather than jactitator's voice data.
Normalization process is carried out in identification, and tested speech data the speaker model claimed and are pretended to be simultaneously and largely
Person's model is compared, and obtains jactitator's score respectively, and then calculates the distribution of jactitator's score and normalized parameter μ1And σ1。
3) using the mean value for the normalized value being calculated by Z-Norm and T-Norm as final normalized value, to be formed
Score s' after likelihood ratio score s normalization.Wherein, z-norm and T-Norm is the prior art, is no longer repeated one by one herein.
Further, in another embodiment, in order to improve the confidence level and accuracy of recognition result, erroneous judgement is reduced
Probability, S204 may include: the identification result that the audio-frequency information is determined based on the likelihood ratio score;Using likelihood ratio
Whether verification method examines the identification result credible, and when verifying the identification credible result, exports identity
Recognition result;Wherein, when the verifying identification result is insincere, S201 is returned to, or terminate process.
It is the schematic diagram of null hypothesis and alternative hypothesis that one embodiment of the invention provides also referring to Fig. 3, Fig. 3.
Specifically, verifying is considered as a kind of with null hypothesis H0Hypothesis Testing Problem, wherein audio vector i have phase
Same speaker and digital latent variable uiAnd vjAnd alternative hypothesis H1.I, j is the positive integer more than or equal to 1.
Null hypothesis H0In the corresponding number said of people;Alternative hypothesis H1In a people correspond to multiple numbers, Huo Zheyi
A corresponding multiple people of number or the multiple number mixing of multiple people.U1、U2、V1、V2Indicate different speakers.
Terminal can be verified by comparing different a possibility that assuming lower data as shown in Figure 3.Xt represents one
People has said a number;Such as: i has said j number.It is that a people says that Xs, which represents this number,;Such as: j number is that i is said.
εtIndicate the error of Xt, εsIndicate the error of Xs.
In null hypothesis H1Under, feature Xt and Xs are mismatched.
According to hypothesis H0, both sides judgement is consistent, feature Xt and Xs matching.
When feature Xt and Xs matching, the identification of audio-frequency information is determined the result is that accurate, believable, output identity is known
Other result.
When feature Xt and Xs are mismatched, the identification of audio-frequency information is determined the result is that incredible, it is understood that there may be accidentally
Sentence, return to S201 at this time or terminates the process of identification speaker's identity.
It further, can also include S205 after S204: when the identification result is credible, and identity school
Test by when, respond the phonetic control command from the corresponding speaker of the audio-frequency information, and execute the voice control and refer to
Enable corresponding predetermined registration operation.
Terminal is getting identification credible result, judges the identification result based on preset legal identity information
Whether corresponding speaker is legitimate user, when the identification result is corresponding speak artificial legitimate user when, determine verification
Pass through.Later, when getting the phonetic control command of speaker input, the phonetic control command is responded, the voice is obtained
The corresponding predetermined registration operation of control instruction, and execute the corresponding predetermined registration operation of the phonetic control command.
For example, phonetic control command is search instruction, and the search instruction is for responding the search and referring to when searching for certain article
It enables, from the relevant information of local data base or certain corresponding article of the network data library searching search instruction.
The embodiment of the present invention, by the loudspeaker latent variable and digital latent variable of extracting audio-frequency information to be identified;When
When loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset Bayesian model and is handled, is obtained
The likelihood ratio score of audio-frequency information;Identification result based on likelihood ratio score output audio-frequency information.Since preset requirement is
Value based on loudspeaker latent variable corresponding to the clear and legible audio-frequency information recognized is configured, when the loudspeaker in audio-frequency information
When latent variable meets preset requirement, interference of loudspeaker performance to numeric utterance to identification result itself can be excluded, this
When each of said based on person under test number digital latent variable identification speaker identity information, due to the number of each number
Latent variable can have multiple, and therefore, instant speaker pronounces for same numbers in different moments different, can also accurately identify
The identity of speaker, can be avoided has different pronunciations and speaker couple for identical number because of different loudspeakers
It is different in the pronunciation of different moments in same numbers, so that the case where interfering identification result, can be improved identification knot
The accuracy of fruit.Likelihood ratio score based on audio-frequency information exports the identification of the audio-frequency information as a result, it is possible to reduce erroneous judgement
Probability further increases the accuracy of recognition result.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
Referring to Fig. 4, Fig. 4 is a kind of schematic diagram for terminal that one embodiment of the invention provides.The each unit that terminal includes
For executing each step in the corresponding embodiment of FIG. 1 to FIG. 2.Referring specifically in the corresponding embodiment of FIG. 1 to FIG. 2
Associated description.For ease of description, only the parts related to this embodiment are shown.Referring to fig. 4, terminal 4 includes:
Acquiring unit 410 is directed to the audio-frequency information to be identified that benchmark numeric string is said for obtaining person under test;Wherein,
The benchmark numeric string is to be stored in advance, and shuffle or random display, the audio-frequency information include that the person under test says
The corresponding audio of numeric string;
Extraction unit 420, for extracting the loudspeaker latent variable and digital latent variable of the audio-frequency information;Wherein, institute
Loudspeaker latent variable is stated for identifying the characteristic information of loudspeaker, the number latent variable is in the number for confirming that the person under test says
Word string is extracted when identical as the benchmark numeric string, and the number latent variable is for identifying person under test described in the audio-frequency information
To the pronunciation character of number;
Recognition unit 430, for when the loudspeaker latent variable meets preset requirement, the digital latent variable to be inputted
Preset Bayesian model carries out Application on Voiceprint Recognition, obtains identification result;Wherein, the preset requirement is recognized based on clear and legible
Audio-frequency information corresponding to the value of loudspeaker latent variable be configured;The Bayesian model is calculated by using machine learning
The digital latent variable that method concentrates single speaker to say each number sample sound is trained to obtain;Each number latent variable
With the identity label for identifying speaker belonging to the number latent variable;What the Bayesian model and the sample sound were concentrated
The single speaker has corresponding relationship.
Further, extraction unit 420 includes:
First detection unit, for extracting loudspeaker latent variable from the audio-frequency information, and it is latent based on the loudspeaker
Whether the value detection audio-frequency information of variable is qualified;
Second detection unit, for when the audio-frequency information qualification, based on the benchmark numeric string and described to be measured
The corresponding audio of the numeric string that person says, detects numeric string that the person under test says and whether the benchmark numeric string is identical;
Latent variable extraction unit, for extracting digital latent variable from the audio-frequency information when testing result is identical.
Further, recognition unit 430 includes:
Computing unit handles for the digital latent variable to be inputted preset Bayesian model, obtains the sound
The likelihood ratio score of frequency information;
Identity recognizing unit, for exporting the identification result of the audio-frequency information based on the likelihood ratio score.
Further, identity recognizing unit is specifically used for: using formula s'=(s- μ1)/δ1To the likelihood ratio score into
Row normalized, and export based on the likelihood ratio score after the normalized identification knot of the audio-frequency information
Fruit;Wherein, s' is the likelihood ratio score after the normalized, and s is the likelihood ratio score, μ1It is false speaker's score
The approximate average of distribution, δ1It is the standard deviation of false speaker's score distribution.
Further, identity recognizing unit is specifically used for: the body of the audio-frequency information is determined based on the likelihood ratio score
Part recognition result;It examines the identification result whether credible using likelihood ratio verification method, and knows verifying the identity
When other credible result, identification result is exported;Wherein, when verify the identification result it is insincere when, terminate process or
Person's acquiring unit 410 executes the acquisition person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said.
Fig. 5 be another embodiment of the present invention provides a kind of terminal schematic diagram.As shown in figure 5, the terminal 5 of the embodiment
Include: processor 50, memory 51 and is stored in the calculating that can be run in the memory 51 and on the processor 50
Machine program 52.The processor 50 realizes the identification speaker of above-mentioned each terminal method when executing the computer program 52
Step in embodiment, such as S101 shown in FIG. 1 to S103.Alternatively, the processor 50 executes the computer program 52
The function of each unit in the above-mentioned each Installation practice of Shi Shixian, such as the function of unit 310 to 340 shown in Fig. 4.
Illustratively, the computer program 52 can be divided into one or more units, one or more of
Unit is stored in the memory 51, and is executed by the processor 50, to complete the present invention.One or more of lists
Member can be the series of computation machine program instruction section that can complete specific function, and the instruction segment is for describing the computer journey
Implementation procedure of the sequence 52 in the terminal 4.For example, the computer program 52 can be divided into acquiring unit, extract list
Member and recognition unit, each unit concrete function are as described above.
The terminal may include, but be not limited only to, processor 50, memory 51.It will be understood by those skilled in the art that figure
5 be only the example of terminal 5, and the not restriction of structure paired terminal 5 may include components more more or fewer than diagram, or
Combine certain components or different components, for example, the terminal can also include input/output terminal, network insertion terminal,
Bus etc..
Alleged processor 50 can be central processing unit (Central Processing Unit, CPU), can also be
Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng.
The memory 51 can be the internal storage unit of the terminal 5, such as the hard disk or memory of terminal 5.It is described
Memory 51 is also possible to the external storage terminal of the terminal 5, such as the plug-in type hard disk being equipped in the terminal 5, intelligence
Storage card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card)
Deng.Further, the memory 51 can also both including the terminal 5 internal storage unit and also including external storage end
End.The memory 51 is for other programs and data needed for storing the computer program and the terminal.It is described to deposit
Reservoir 51 can be also used for temporarily storing the data that has exported or will export.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all
It is included within protection scope of the present invention.
Claims (10)
1. a kind of method for identifying speaker characterized by comprising
It obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said;Wherein, the benchmark numeric string is preparatory
Storage, and shuffle or random display, the audio-frequency information include the corresponding audio of numeric string that the person under test says;
Extract the loudspeaker latent variable and digital latent variable of the audio-frequency information;Wherein, the loudspeaker latent variable is for marking
Know the characteristic information of loudspeaker, the number latent variable is in the numeric string and the benchmark numeric string for confirming that the person under test says
It is extracted when identical, the number latent variable is used to identify person under test described in the audio-frequency information to the pronunciation character of number;
When the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset Bayesian model and is carried out
Application on Voiceprint Recognition obtains identification result;Wherein, the preset requirement corresponding to the clear and legible audio-frequency information recognized based on raising
The value of sound device latent variable is configured;The Bayesian model is single to sample sound concentration by using machine learning algorithm
The digital latent variable that speaker says each number is trained to obtain, and each digital latent variable has the mark number latent
The identity label of speaker belonging to variable;The single speaker that the Bayesian model and the sample sound are concentrated has
There is corresponding relationship.
2. the method according to claim 1, wherein the loudspeaker latent variable for extracting the audio-frequency information with
And digital latent variable, comprising:
Loudspeaker latent variable is extracted from the audio-frequency information, and is based on the value of loudspeaker latent variable detection audio-frequency information
No qualification;
When the audio-frequency information qualification, the corresponding sound of numeric string said based on the benchmark numeric string and the person under test
Frequently, it detects numeric string that the person under test says and whether the benchmark numeric string is identical;
When testing result is identical, digital latent variable is extracted from the audio-frequency information.
3. method according to claim 1 or 2, which is characterized in that described when the loudspeaker latent variable meets default want
When asking, the digital latent variable is inputted into preset Bayesian model and carries out Application on Voiceprint Recognition, obtains identification result, comprising:
The digital latent variable is inputted preset Bayesian model to handle, obtains the likelihood score of the audio-frequency information
Number;
The identification result of the audio-frequency information is exported based on the likelihood ratio score.
4. according to the method described in claim 3, it is characterized in that, described export the audio letter based on the likelihood ratio score
The identification result of breath, comprising:
Using formula s'=(s- μ1)/δ1The likelihood ratio score is normalized, and based on after the normalized
Likelihood ratio score export the identification result of the audio-frequency information;Wherein, s' is the likelihood ratio after the normalized
Score, s are the likelihood ratio score, μ1It is the approximate average of false speaker's score distribution, δ1It is false speaker's score point
The standard deviation of cloth.
5. according to the method described in claim 3, it is characterized in that, described export the audio letter based on the likelihood ratio score
The identification result of breath, comprising:
The identification result of the audio-frequency information is determined based on the likelihood ratio score;
Examine the identification result whether credible using likelihood ratio verification method, and can verifying the identification result
When letter, the identification result is exported;Wherein, when verify the identification result it is insincere when, return it is described obtain to
Survey person is directed to the audio-frequency information to be identified that benchmark numeric string is said, or terminates.
6. a kind of terminal characterized by comprising
Acquiring unit is directed to the audio-frequency information to be identified that benchmark numeric string is said for obtaining person under test;Wherein, the benchmark
Numeric string is to be stored in advance, and shuffle or random display, the audio-frequency information include the numeric string that the person under test says
Corresponding audio;
Extraction unit, for extracting the loudspeaker latent variable and digital latent variable of the audio-frequency information;Wherein, the loudspeaker
Latent variable is used to identify the characteristic information of loudspeaker, and the number latent variable is in the numeric string and institute for confirming that the person under test says
State benchmark numeric string it is identical when extraction, it is described number latent variable be used for identify person under test described in the audio-frequency information to number
Pronunciation character;
Recognition unit, for the digital latent variable being inputted preset when the loudspeaker latent variable meets preset requirement
Bayesian model carries out Application on Voiceprint Recognition, obtains identification result;Wherein, the Bayesian model is by using machine learning
The digital latent variable that algorithm concentrates single speaker to say each number sample sound is trained to obtain;Each number
Latent variable has the identity label for identifying the corresponding affiliated speaker of the number latent variable;The Bayesian model and the sound
The single speaker in sample set has corresponding relationship.
7. a kind of terminal, which is characterized in that the terminal include memory, processor and storage in the memory and can
The computer program run on the processor, the processor realize following steps when executing the computer program:
It obtains person under test and is directed to the audio-frequency information to be identified that benchmark numeric string is said;Wherein, the benchmark numeric string is preparatory
Storage, and shuffle or random display, the audio-frequency information include the corresponding audio of numeric string that the person under test says;
Extract the loudspeaker latent variable and digital latent variable of the audio-frequency information;Wherein, the loudspeaker latent variable is for marking
Know the characteristic information of loudspeaker, the number latent variable is in the numeric string and the benchmark numeric string for confirming that the person under test says
It is extracted when identical, the number latent variable is used to identify person under test described in the audio-frequency information to the pronunciation character of number;
When the loudspeaker latent variable meets preset requirement, the digital latent variable is inputted into preset Bayesian model and is carried out
Application on Voiceprint Recognition obtains identification result;Wherein, the preset requirement corresponding to the clear and legible audio-frequency information recognized based on raising
The value of sound device latent variable is configured;The Bayesian model is single to sample sound concentration by using machine learning algorithm
The digital latent variable that speaker says each number is trained to obtain;Each digital latent variable has the mark number latent
The identity label of speaker belonging to variable;The single speaker that the Bayesian model and the sample sound are concentrated has
There is corresponding relationship.
8. terminal according to claim 7, which is characterized in that the loudspeaker latent variable for extracting the audio-frequency information with
And digital latent variable, comprising:
Loudspeaker latent variable is extracted from the audio-frequency information, and is based on the value of loudspeaker latent variable detection audio-frequency information
No qualification;
When the audio-frequency information qualification, the corresponding sound of numeric string said based on the benchmark numeric string and the person under test
Frequently, it detects numeric string that the person under test says and whether the benchmark numeric string is identical;
When testing result is identical, digital latent variable is extracted from the audio-frequency information.
9. terminal according to claim 8, which is characterized in that it is described when loudspeaker latent variable meets the requirements, it will be described
Digital latent variable inputs preset Bayesian model and carries out Application on Voiceprint Recognition, obtains identification result, comprising:
The digital latent variable is inputted preset Bayesian model to handle, obtains the likelihood score of the audio-frequency information
Number;
The identification result of the audio-frequency information is exported based on the likelihood ratio score.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In when the computer program is executed by processor the step of any one of such as claim 1 to 5 of realization the method.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910354414.7A CN110111798B (en) | 2019-04-29 | 2019-04-29 | Method, terminal and computer readable storage medium for identifying speaker |
PCT/CN2019/103299 WO2020220541A1 (en) | 2019-04-29 | 2019-08-29 | Speaker recognition method and terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910354414.7A CN110111798B (en) | 2019-04-29 | 2019-04-29 | Method, terminal and computer readable storage medium for identifying speaker |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110111798A true CN110111798A (en) | 2019-08-09 |
CN110111798B CN110111798B (en) | 2023-05-05 |
Family
ID=67487460
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910354414.7A Active CN110111798B (en) | 2019-04-29 | 2019-04-29 | Method, terminal and computer readable storage medium for identifying speaker |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110111798B (en) |
WO (1) | WO2020220541A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110503956A (en) * | 2019-09-17 | 2019-11-26 | 平安科技(深圳)有限公司 | Audio recognition method, device, medium and electronic equipment |
CN111768789A (en) * | 2020-08-03 | 2020-10-13 | 上海依图信息技术有限公司 | Electronic equipment and method, device and medium for determining identity of voice sender thereof |
WO2020220541A1 (en) * | 2019-04-29 | 2020-11-05 | 平安科技(深圳)有限公司 | Speaker recognition method and terminal |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112820297A (en) * | 2020-12-30 | 2021-05-18 | 平安普惠企业管理有限公司 | Voiceprint recognition method and device, computer equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150235637A1 (en) * | 2014-02-14 | 2015-08-20 | Google Inc. | Recognizing speech in the presence of additional audio |
CN106448685A (en) * | 2016-10-09 | 2017-02-22 | 北京远鉴科技有限公司 | System and method for identifying voice prints based on phoneme information |
CN106531171A (en) * | 2016-10-13 | 2017-03-22 | 普强信息技术(北京)有限公司 | Method for realizing dynamic voiceprint password system |
CN107104803A (en) * | 2017-03-31 | 2017-08-29 | 清华大学 | It is a kind of to combine the user ID authentication method confirmed with vocal print based on numerical password |
CN109256138A (en) * | 2018-08-13 | 2019-01-22 | 平安科技(深圳)有限公司 | Auth method, terminal device and computer readable storage medium |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5982297B2 (en) * | 2013-02-18 | 2016-08-31 | 日本電信電話株式会社 | Speech recognition device, acoustic model learning device, method and program thereof |
US20190087734A1 (en) * | 2016-03-28 | 2019-03-21 | Sony Corporation | Information processing apparatus and information processing method |
JP2018013722A (en) * | 2016-07-22 | 2018-01-25 | 国立研究開発法人情報通信研究機構 | Acoustic model optimization device and computer program therefor |
KR101843074B1 (en) * | 2016-10-07 | 2018-03-28 | 서울대학교산학협력단 | Speaker recognition feature extraction method and system using variational auto encoder |
US9911413B1 (en) * | 2016-12-28 | 2018-03-06 | Amazon Technologies, Inc. | Neural latent variable model for spoken language understanding |
CN109166586B (en) * | 2018-08-02 | 2023-07-07 | 平安科技(深圳)有限公司 | Speaker identification method and terminal |
CN110111798B (en) * | 2019-04-29 | 2023-05-05 | 平安科技(深圳)有限公司 | Method, terminal and computer readable storage medium for identifying speaker |
-
2019
- 2019-04-29 CN CN201910354414.7A patent/CN110111798B/en active Active
- 2019-08-29 WO PCT/CN2019/103299 patent/WO2020220541A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150235637A1 (en) * | 2014-02-14 | 2015-08-20 | Google Inc. | Recognizing speech in the presence of additional audio |
CN106448685A (en) * | 2016-10-09 | 2017-02-22 | 北京远鉴科技有限公司 | System and method for identifying voice prints based on phoneme information |
CN106531171A (en) * | 2016-10-13 | 2017-03-22 | 普强信息技术(北京)有限公司 | Method for realizing dynamic voiceprint password system |
CN107104803A (en) * | 2017-03-31 | 2017-08-29 | 清华大学 | It is a kind of to combine the user ID authentication method confirmed with vocal print based on numerical password |
CN109256138A (en) * | 2018-08-13 | 2019-01-22 | 平安科技(深圳)有限公司 | Auth method, terminal device and computer readable storage medium |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020220541A1 (en) * | 2019-04-29 | 2020-11-05 | 平安科技(深圳)有限公司 | Speaker recognition method and terminal |
CN110503956A (en) * | 2019-09-17 | 2019-11-26 | 平安科技(深圳)有限公司 | Audio recognition method, device, medium and electronic equipment |
CN110503956B (en) * | 2019-09-17 | 2023-05-12 | 平安科技(深圳)有限公司 | Voice recognition method, device, medium and electronic equipment |
CN111768789A (en) * | 2020-08-03 | 2020-10-13 | 上海依图信息技术有限公司 | Electronic equipment and method, device and medium for determining identity of voice sender thereof |
CN111768789B (en) * | 2020-08-03 | 2024-02-23 | 上海依图信息技术有限公司 | Electronic equipment, and method, device and medium for determining identity of voice generator of electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
WO2020220541A1 (en) | 2020-11-05 |
CN110111798B (en) | 2023-05-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yu et al. | Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features | |
CN107104803B (en) | User identity authentication method based on digital password and voiceprint joint confirmation | |
CN111712874B (en) | Method, system, device and storage medium for determining sound characteristics | |
CN109166586B (en) | Speaker identification method and terminal | |
CN110111798A (en) | A kind of method and terminal identifying speaker | |
Dey et al. | Speech biometric based attendance system | |
CN107924682A (en) | Neutral net for speaker verification | |
US6205424B1 (en) | Two-staged cohort selection for speaker verification system | |
CN107610707A (en) | A kind of method for recognizing sound-groove and device | |
US10630680B2 (en) | System and method for optimizing matched voice biometric passphrases | |
EP4233047A1 (en) | Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium | |
TW202213326A (en) | Generalized negative log-likelihood loss for speaker verification | |
CN115457938A (en) | Method, device, storage medium and electronic device for identifying awakening words | |
CN110364168A (en) | A kind of method for recognizing sound-groove and system based on environment sensing | |
US6499012B1 (en) | Method and apparatus for hierarchical training of speech models for use in speaker verification | |
Bui et al. | A non-linear GMM KL and GUMI kernel for SVM using GMM-UBM supervector in home acoustic event classification | |
JPWO2020003413A1 (en) | Information processing equipment, control methods, and programs | |
Mandalapu et al. | Multilingual voice impersonation dataset and evaluation | |
CN110931020B (en) | Voice detection method and device | |
Hong et al. | Generalization ability improvement of speaker representation and anti-interference for speaker verification | |
CN108694950B (en) | Speaker confirmation method based on deep hybrid model | |
CN112133291A (en) | Language identification model training, language identification method and related device | |
Lotia et al. | A review of various score normalization techniques for speaker identification system | |
Mohamed et al. | An Overview of the Development of Speaker Recognition Techniques for Various Applications. | |
Yang et al. | User verification based on customized sentence reading |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |