WO2020220541A1 - 一种识别说话人的方法及终端 - Google Patents

一种识别说话人的方法及终端 Download PDF

Info

Publication number
WO2020220541A1
WO2020220541A1 PCT/CN2019/103299 CN2019103299W WO2020220541A1 WO 2020220541 A1 WO2020220541 A1 WO 2020220541A1 CN 2019103299 W CN2019103299 W CN 2019103299W WO 2020220541 A1 WO2020220541 A1 WO 2020220541A1
Authority
WO
WIPO (PCT)
Prior art keywords
latent variable
audio information
speaker
digital
likelihood ratio
Prior art date
Application number
PCT/CN2019/103299
Other languages
English (en)
French (fr)
Inventor
张丝潆
曾庆亮
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020220541A1 publication Critical patent/WO2020220541A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/12Score normalisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • This application belongs to the field of computer technology, and in particular relates to a method and terminal for identifying a speaker.
  • Voiceprint refers to the information graph of the speaker's speech frequency spectrum. Since each person's vocal organs are different, the sounds and their pitches are different, so it is irreplaceable and stable to use voiceprint as the basic feature for identification. There are two types of voiceprint recognition: Text-Dependent and Text-Independent.
  • text-related speaker verification systems are more conducive to security applications because they tend to show higher accuracy in short-term conversations.
  • the typical text-related speaker recognition is to let each user use a fixed phrase to match the registered and test phrases. In this case, you can record the utterance from the user in advance and then play it. In the case that the training and test utterances have different scenarios, by sharing the same voice content, the security of recognition protection can be improved to a certain extent.
  • the system will randomly give some number strings, and the user needs to read the corresponding content correctly. Recognizing voiceprints, the introduction of this randomness makes each voiceprint collected in text-related recognition have a difference in content sequence.
  • a random prompt number string is used to identify a speaker, some samples have fixed digital vocabulary and limited number samples, and for the same number, different speakers have different pronunciations, which may lead to inaccurate identification of the speaker's identity.
  • the embodiments of the present application provide a method and terminal for recognizing a speaker, so as to solve the problem that in the prior art, for complex voiceprint voice information (for example, short speech, imitated voice, etc.), the text-independent voiceprint recognition system cannot Accurately extract the speaker's voice features, which leads to the problem of inaccurate identification of the speaker's identity.
  • complex voiceprint voice information for example, short speech, imitated voice, etc.
  • the embodiments of the present application provide a method and terminal for recognizing a speaker to solve the problem of complex voiceprint voice information (for example, short speech, imitated voice, etc.) in the prior art, and text-independent voice
  • voiceprint voice information for example, short speech, imitated voice, etc.
  • the pattern recognition system cannot accurately extract the speaker’s voice features, which leads to the problem of inaccurate identification of the speaker’s identity.
  • the first aspect of the embodiments of the present application provides a method for identifying a speaker, including:
  • the speaker latent variable and the digital latent variable of the audio information are extracted; wherein, the speaker latent variable is used to identify the characteristic information of the speaker, and the digital latent variable is used to confirm that the number string spoken by the test subject and the reference Extract when the number strings are the same, and the digital latent variable is used to identify the pronunciation characteristics of the number of the test subject in the audio information;
  • the digital latent variable of the loudspeaker When the latent variable of the loudspeaker meets the preset requirement, the digital latent variable is input into a preset Bayesian model to perform voiceprint recognition, and an identity recognition result is obtained; wherein, the preset requirement is based on clear and recognizable audio information
  • the value of the corresponding loudspeaker latent variable is set; the Bayesian model is obtained by training the digital latent variable of each number spoken by a single speaker in the sound sample set by using a machine learning algorithm; each digital latent variable There is an identity tag that identifies the speaker to which the digital latent variable belongs; the Bayesian model has a corresponding relationship with the single speaker in the sound sample set.
  • the second aspect of the embodiments of the present application provides a terminal, including:
  • the acquiring unit is used to acquire the to-be-identified audio information spoken by the test subject for the reference number string; wherein the reference number string is pre-stored and randomly played or displayed randomly, and the audio information includes the test subject The audio corresponding to the spoken number string;
  • the extraction unit is used to extract the speaker latent variable and the digital latent variable of the audio information; wherein the speaker latent variable is used to identify characteristic information of the speaker, and the digital latent variable is used to confirm the number spoken by the test subject When the string is the same as the reference number string, the digital latent variable is used to identify the pronunciation characteristics of the number of the test subject in the audio information;
  • the recognition unit is configured to input the digital latent variable into a preset Bayesian model for voiceprint recognition when the latent variable of the loudspeaker meets the preset requirements to obtain an identity recognition result; wherein the Bayesian model is The digital latent variable of each digit spoken by a single speaker in the sound sample set is trained by using a machine learning algorithm; each digital latent variable has an identity tag that identifies the speaker corresponding to the digital latent variable; The Yesh model has a corresponding relationship with the single speaker in the sound sample set.
  • a third aspect of the embodiments of the present application provides a terminal, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer The following steps are implemented when reading instructions:
  • the speaker latent variable and the digital latent variable of the audio information are extracted; wherein, the speaker latent variable is used to identify the characteristic information of the speaker, and the digital latent variable is used to confirm that the number string spoken by the test subject and the reference Extract when the number strings are the same, and the digital latent variable is used to identify the pronunciation characteristics of the number of the test subject in the audio information;
  • the digital latent variable of the loudspeaker When the latent variable of the loudspeaker meets the preset requirement, the digital latent variable is input into a preset Bayesian model to perform voiceprint recognition, and an identity recognition result is obtained; wherein, the preset requirement is based on clear and recognizable audio information
  • the value of the corresponding loudspeaker latent variable is set; the Bayesian model is obtained by training the digital latent variable of each number spoken by a single speaker in the sound sample set by using a machine learning algorithm; each digital latent variable There is an identity tag that identifies the speaker to which the digital latent variable belongs; the Bayesian model has a corresponding relationship with the single speaker in the sound sample set.
  • the fourth aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When implementing the following steps:
  • the speaker latent variable and the digital latent variable of the audio information are extracted; wherein, the speaker latent variable is used to identify the characteristic information of the speaker, and the digital latent variable is used to confirm that the number string spoken by the test subject and the reference Extract when the number strings are the same, and the digital latent variable is used to identify the pronunciation characteristics of the number of the test subject in the audio information;
  • the digital latent variable of the loudspeaker When the latent variable of the loudspeaker meets the preset requirement, the digital latent variable is input into a preset Bayesian model to perform voiceprint recognition, and an identity recognition result is obtained; wherein, the preset requirement is based on clear and recognizable audio information
  • the value of the corresponding loudspeaker latent variable is set; the Bayesian model is obtained by training the digital latent variable of each number spoken by a single speaker in the sound sample set by using a machine learning algorithm; each digital latent variable There is an identity tag that identifies the speaker to which the digital latent variable belongs; the Bayesian model has a corresponding relationship with the single speaker in the sound sample set.
  • the speaker latent variable and the digital latent variable of the audio information to be recognized are extracted; when the speaker latent variable meets the requirements, the digital latent variable is input into the preset Bayesian model for voiceprint recognition, and the identity recognition result is obtained . Since the preset requirement is set based on the value of the latent variable of the speaker corresponding to the clearly identifiable audio information, when the latent variable of the speaker in the audio information meets the preset requirement, the effect of the speaker's own performance on the digital pronunciation can be excluded. Interference. At this time, the speaker’s identity information is identified based on the digital latent variable of each number spoken by the testee. Since there can be multiple digital latent variables for each number, the speaker can pronounce the same number at different times.
  • Fig. 1 is an implementation flowchart of a method for identifying a speaker provided by an embodiment of the present application
  • FIG. 2 is an implementation flowchart of a method for identifying a speaker provided by another embodiment of the present application
  • FIG. 3 is a schematic diagram of the null hypothesis and the alternative hypothesis provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a terminal provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of a terminal provided by another embodiment of the present application.
  • FIG. 1 is an implementation flowchart of a method for identifying a speaker according to an embodiment of the present application.
  • the execution subject of the speaker identification method in this embodiment is the terminal.
  • Terminals include, but are not limited to, mobile terminals such as smart phones, tablet computers, and wearable devices, and may also be desktop computers.
  • the method of identifying the speaker as shown in the figure may include:
  • S101 Acquire the to-be-identified audio information spoken by the test subject for a reference number string; wherein the reference number string is pre-stored and randomly played or displayed randomly, and the audio information includes the voice information spoken by the test subject The audio corresponding to the digital string.
  • the terminal When the terminal detects the speaker recognition instruction, it can obtain the audio information to be recognized from the speaker in the surrounding environment through the built-in sound pickup device (for example, microphone, speaker). At this time, the audio information sent by the speaker is based on The terminal randomly sends out the reference number string; or the terminal obtains the audio file or video file corresponding to the file identifier according to the file identifier contained in the speaker recognition instruction, and extracts the audio information in the audio file or video file and recognizes it as Audio information to be recognized.
  • the audio file or video file contains the audio information obtained by the testee reading out the reference number string.
  • the audio file or video file can be uploaded by the user or downloaded from the server used to store the audio file or video file.
  • the reference number string is pre-stored in the terminal and randomly played or displayed by the terminal.
  • the number of reference numeral strings may include multiple ones.
  • the audio information to be recognized includes audio corresponding to a digital string, and the digital string consists of at least one number.
  • S102 Extract the speaker latent variable and the digital latent variable of the audio information; wherein the speaker latent variable is extracted when it is confirmed that the number string spoken by the test subject is the same as the reference number string, the digital latent variable It is used to identify characteristic information of the speaker, and the digital latent variable is used to identify the pronunciation characteristic of the number of the test subject in the audio information.
  • the terminal calculates the speaker variable of the audio information based on the acquired audio information.
  • the latent variables of the loudspeaker include but are not limited to the signal-to-noise ratio, and can also include the efficiency of the loudspeaker, sound pressure level, and so on.
  • the signal-to-noise ratio (signal-to-noise ratio) is a parameter describing the proportional relationship between the effective component and the noise component in the signal. Among them, the higher the signal-to-noise ratio of the speaker, the clearer the sound picked up by the speaker.
  • the terminal extracts the normal sound signal and the noise signal when there is no signal from the audio information, and calculates the signal-to-noise ratio of the audio information based on the normal sound signal and the noise signal.
  • the terminal can input the acquired audio information into a pre-trained Deep Neural Networks (DNN) model, and extract the digital latent variable of each number in the audio information through the deep neural network.
  • Digital latent variables are used to identify the pronunciation characteristics of the same number.
  • the same number can have at least two different digital latent variables, that is, at least two different pronunciations. For example, the pronunciation of the number "1" includes “ ⁇ ", " ⁇ ” and so on.
  • the deep neural network model includes an input layer, a hidden layer, and an output layer.
  • the input layer includes an input layer node for receiving input audio information from the outside.
  • the hidden layer includes more than two hidden layer nodes, which are used to process audio information and extract digital latent variables of audio information.
  • the output layer is used to output the processing result, that is, the digital latent variable of the output audio information.
  • the deep neural network model is trained based on the sound sample set, which includes the sound sample corresponding to each number spoken by the speaker.
  • the number of speakers can be multiple, such as 500 or 1500. To a certain extent, the more samples are trained, the more accurate the results will be when the neural network model obtained by training is used for recognition.
  • the sound samples include a preset number of sound samples, each sound sample has a labeled digital latent variable, and each digital latent variable corresponds to a sample label one-to-one.
  • the sound sample can include only one number, and each number in the sound sample set can correspond to at least two digital latent variables.
  • the number included in the sound sample includes all the numbers included in the reference number string that may be randomly generated. For example, when the randomly generated reference number string includes any 6 digits among 0-9 and 10 digits, the sound sample set includes sound samples of 10 digits 0-9.
  • the input data input to the deep neural network model is audio information.
  • the audio information can be a vector matrix obtained based on the audio information.
  • the vector matrix is composed of vectors corresponding to digital audio data extracted from the audio information in turn .
  • the output data of the deep neural network model is the digital latent variable of each number in the audio information.
  • one speaker corresponds to a neural network recognition model.
  • the neural network recognition model corresponding to each speaker is trained.
  • the terminal judges whether the extracted latent variable of the speaker meets the preset requirement based on the preset condition.
  • the preset condition is used to determine whether the sound in the audio information is clear and recognizable, and the preset condition can be set based on the value of the latent variable of the speaker corresponding to the clear and recognizable sound.
  • the preset condition may be that the signal-to-noise ratio is greater than or equal to the preset signal-to-noise ratio threshold.
  • the preset signal-to-noise ratio threshold is the signal-to-noise ratio corresponding to a clearly recognizable sound.
  • S103 When the latent variable of the loudspeaker meets a preset requirement, input the digital latent variable into a preset Bayesian model for voiceprint recognition to obtain an identity recognition result; wherein, the preset requirement is based on a clearly identifiable The value of the latent variable of the speaker corresponding to the audio information is set; the Bayesian model is obtained by using a machine learning algorithm to train the digital latent variable of each number spoken by a single speaker in the sound sample set; The latent variable has an identity tag that identifies the speaker to which the digital latent variable belongs; the Bayesian model has a corresponding relationship with the single speaker in the sound sample set.
  • the identity tag of the digital latent variable indicates: the identity of the speaker who said the number.
  • the expression of the preset Bayesian model is as follows:
  • u i ,v j , ⁇ ) represents the probability that a person said a number
  • x ijk represents the i-th person said the j-th number in the k-th conversation.
  • the speaker may be required to say multiple different number strings. Therefore, k represents the k-th conversation, and ⁇ is the general name of the parameters in the Bayesian model.
  • u i represents the digital latent variable of speaker i
  • u i is defined as a Gaussian with diagonal covariance ⁇ u
  • p(u i ) represents the probability of speaker i
  • 0, ⁇ u ) is Regarding the Gaussian distribution of ⁇ u
  • v j represents the digital latent variable of speaker j
  • v j is defined as a Gaussian with diagonal covariance ⁇ v
  • p(v j ) represents the probability of speaker j
  • 0, ⁇ v ) is the Gaussian distribution of ⁇ v .
  • the Bayesian model can be described by conditional probability.
  • ⁇ , ⁇ ) represents the Gaussian in x, with average ⁇ and covariance ⁇ .
  • the terminal When the terminal completes iterative calculation of the corresponding probability of each number contained in the audio information that the speaker has said, it will identify the speaker based on the corresponding probability of each number calculated in each iteration to obtain the identification result.
  • the preset probability threshold for example, 0.8
  • 1 point is scored, and the probability of speaker i saying a number is less than the preset probability threshold ( For example, at 0.8), score 0 and count the total score of speaker i after 10 iterations.
  • the preset score threshold for example, 7 points
  • the speaker latent variable and the digital latent variable of the audio information to be recognized are extracted; when the speaker latent variable meets the requirements, the digital latent variable is input into the preset Bayesian model for voiceprint recognition, and the identity recognition result is obtained . Since the preset requirement is set based on the value of the latent variable of the speaker corresponding to the clearly identifiable audio information, when the latent variable of the speaker in the audio information meets the preset requirement, it can be ruled out that the speaker itself has a different pronunciation of the number. The interference of the identification result. At this time, the speaker’s identity information is recognized based on the digital latent variable of each digit spoken by the testee. Since there can be multiple digital latent variables for each digit, the speaker is at different moments.
  • FIG. 2 is a flowchart of an implementation of a method for identifying a speaker according to another embodiment of the present application.
  • the execution subject of the speaker identification method in this embodiment is the terminal.
  • Terminals include, but are not limited to, mobile terminals such as smart phones, tablet computers, and wearable devices, and may also be desktop computers.
  • the speaker identification method of this embodiment includes the following steps:
  • S201 Acquire the audio information to be recognized that is spoken by the test subject for the reference number string; wherein the reference number string is pre-stored and randomly played or displayed randomly, and the audio information includes the voice information spoken by the test subject The audio corresponding to the digital string.
  • S201 in this embodiment is the same as S101 in the previous embodiment.
  • S101 in the previous embodiment please refer to the related description of S101 in the previous embodiment, which is not repeated here.
  • S202 Extract the speaker latent variable and the digital latent variable of the audio information; wherein the speaker latent variable is extracted when it is confirmed that the number string spoken by the subject is the same as the reference number string, and the digital latent variable It is used to identify characteristic information of the speaker, and the digital latent variable is used to identify the pronunciation characteristic of the number of the test subject in the audio information.
  • S202 in this embodiment is the same as S102 in the previous embodiment.
  • S102 in the previous embodiment please refer to the related description of S102 in the previous embodiment, which will not be repeated here.
  • S202 may include S2021 to S2023. details as follows:
  • S2021 Extract the speaker latent variable from the audio information, and detect whether the audio information is qualified based on the value of the speaker latent variable.
  • the terminal may extract the speaker latent variable from the audio information to be recognized, and based on the extracted speaker latent variable value and the preset speaker latent variable threshold, detect whether the audio information is qualified to confirm whether the sound in the audio information is clear Can be identified.
  • the latent variable of the speaker is used to identify the characteristic information of the speaker.
  • the latent variables of the loudspeaker include but are not limited to the signal-to-noise ratio, and can also include the efficiency of the loudspeaker, sound pressure level, and so on.
  • the preset speaker latent variable threshold is set based on the value of the speaker latent variable corresponding to the clearly identifiable audio information.
  • the signal-to-noise ratio is a parameter describing the proportional relationship between the effective component and the noise component in the signal. The higher the signal-to-noise ratio of the speaker, the clearer the sound picked up by the speaker.
  • the terminal extracts the normal sound signal and the noise signal when there is no signal from the audio information, and calculates the signal-to-noise ratio of the audio information based on the normal sound signal and the noise signal.
  • the terminal when the latent variable of the loudspeaker is the signal-to-noise ratio, when the terminal detects that the value of the loudspeaker's signal-to-noise ratio is greater than or equal to 70, it determines that the audio information to be identified is qualified, and the numbers in the audio information can be clearly identified.
  • the speaker is prompted to reread the random reference number string to retrieve the audio information; or to retrieve the audio information to be identified from the database corresponding to the stored audio data.
  • the reference number string is a number string randomly generated by the terminal or randomly obtained from the database during the process of identifying the speaker, and prompted to the user.
  • the terminal can randomly play or display the reference number string before S101.
  • voice When the reference number string is played in the broadcast mode, the reference number string is played with standard pronunciation.
  • the reference number string includes a preset number of numbers, for example, the reference number string includes 5 or 6 numbers.
  • the terminal can sequentially extract voice fragments containing digits from the audio information, use voice recognition technology to recognize the digit string contained in the audio information, and compare the reference digit string with the recognized digit string to determine Whether the digital string contained in the audio information is the same as the reference digital string.
  • the terminal can also play the reference digital string to obtain the audio corresponding to the reference digital string, compare the audio corresponding to the reference digital string with the audio corresponding to the digital string spoken by the testee, and detect the digital string spoken by the testee Is it the same as the reference number string?
  • the audio obtained by playing the reference digital string is compared with the audio corresponding to the digital string spoken by the testee.
  • the terminal picks up audio and plays audio information, it can reduce the deviation of digital pronunciation due to the performance of the speaker itself.
  • the terminal When the terminal confirms that the digital string contained in the audio information is the same as the reference digital string, it converts the audio information into a matrix vector and inputs it into the DNN model for processing, and extracts the digital latent variable of each number in the digital string from the audio information.
  • the digital latent variable is used to identify the pronunciation characteristics of the number of the testee in the audio information.
  • the terminal may input the acquired audio information into a pre-trained DNN model, and extract the digital latent variable of each number in the audio information through a deep neural network.
  • Digital latent variables are used to identify the pronunciation characteristics of the same number.
  • the same number can have at least two different digital latent variables, that is, at least two different pronunciations.
  • the pronunciation of the number "1" includes “ ⁇ ", " ⁇ ” and so on.
  • the deep neural network model includes an input layer, a hidden layer, and an output layer.
  • the input layer includes an input layer node for receiving input audio information from the outside.
  • the hidden layer includes more than two hidden layer nodes, which are used to process audio information and extract digital latent variables of audio information.
  • the output layer is used to output the processing result, that is, the digital latent variable of the output audio information.
  • the deep neural network model is trained based on the sound sample set, which includes the sound sample corresponding to each number spoken by the speaker.
  • the number of speakers can be multiple, such as 500 or 1500. To a certain extent, the more samples are trained, the more accurate the results will be when the neural network model obtained by training is used for recognition.
  • the sound samples include a preset number of sound samples, each sound sample has a labeled digital latent variable, and each digital latent variable corresponds to a sample label one-to-one.
  • the sound sample can include only one number, and each number in the sound sample set can correspond to at least two digital latent variables.
  • the number included in the sound sample includes all the numbers included in the reference number string that may be randomly generated. For example, when the randomly generated reference number string includes any 6 digits among 0-9 and 10 digits, the sound sample set includes sound samples of 10 digits 0-9.
  • the input data input to the deep neural network model is audio information.
  • the audio information can be a vector matrix obtained based on the audio information.
  • the vector matrix is composed of vectors corresponding to digital audio data extracted from the audio information in turn .
  • the output data of the deep neural network model is the digital latent variable of each number in the audio information.
  • one speaker corresponds to a neural network recognition model.
  • the neural network recognition model corresponding to each speaker is trained.
  • S203 When the latent variable of the speaker meets a preset requirement, input the digital latent variable into a preset Bayesian model for processing to obtain a likelihood ratio score of the audio information; wherein the preset requirement is based on The value of the loudspeaker latent variable corresponding to the clearly identifiable audio information is set; the Bayesian model is obtained by training the digital latent variable of each digit spoken by a single speaker in the sound sample set by using a machine learning algorithm; Each of the digital latent variables has an identity tag that identifies the speaker to which the digital latent variable belongs; the Bayesian model has a corresponding relationship with the single speaker in the sound sample set.
  • the terminal inputs the digital latent variable of each number contained in the audio information into the preset Bayesian model, and passes the following formula p(x ijk
  • u i ,v j , ⁇ ) N(x ijk
  • ⁇ +u i + v j , ⁇ ⁇ ) Calculate the probability that each number contained in the audio information is spoken by the speaker. Among them, p(u i ) N(u i
  • 0, ⁇ u ); p(v j ) N(v j
  • u i ,v j , ⁇ ) represents the probability that a person said a number
  • x ijk represents the i-th person said the j-th number in the k-th conversation.
  • the speaker may be required to say multiple different number strings. Therefore, k represents the k-th conversation, and ⁇ is the general name of the parameters in the Bayesian model.
  • u i represents the digital latent variable of speaker i
  • u i is defined as a Gaussian with diagonal covariance ⁇ u
  • p(u i ) represents the probability of speaker i
  • 0, ⁇ u ) is Regarding the Gaussian distribution of ⁇ u
  • v j represents the digital latent variable of speaker j
  • v j is defined as a Gaussian with diagonal covariance ⁇ v
  • p(v j ) represents the probability of speaker j
  • 0, ⁇ v ) is the Gaussian distribution of ⁇ v .
  • the Bayesian model can be described by conditional probability.
  • ⁇ , ⁇ ) represents the Gaussian in x, with average ⁇ and covariance ⁇ .
  • the terminal calculates the corresponding probability of each number contained in the audio information the speaker said, it calculates the likelihood ratio score based on the calculated probability.
  • the terminal can calculate the corresponding probability of each number contained in the audio information that the speaker said according to the preset number of iterations (for example), where the preset number of iterations can be 10, or according to actual needs Make settings.
  • the terminal can calculate the average likelihood ratio score of the speaker said the j-th number based on the likelihood ratio scores corresponding to the j-th number that the speaker has said multiple times, and use the average likelihood ratio as the calculation The speaker said the likelihood ratio score of the j-th number.
  • the terminal can also filter out the likelihood ratio scores that are greater than or equal to the preset likelihood ratio score threshold based on the likelihood ratio scores corresponding to the j-th digit that the speaker has said multiple times, and Calculate the average of the filtered likelihood ratio scores, and use the calculated average as the likelihood ratio score of the speaker's j-th number.
  • the preset likelihood ratio score threshold can be 1.2, but it is not limited to this, and can be specifically set according to actual conditions, and there is no limitation here.
  • S204 Output an identity recognition result of the audio information based on the likelihood ratio score.
  • the terminal can confirm the likelihood that speaker i said the j-th digit. However, when the ratio score is greater than 1, it is determined that the speaker of the audio information is speaker i. When the likelihood ratio score of speaker i who said the j-th number is less than or equal to 1, it is determined that the speaker of the audio information is not speaker i.
  • the terminal converts the likelihood ratio scores from different speakers into a similar range by using the above formula, so that a common likelihood ratio score threshold can be used for judgment.
  • ⁇ 1 and ⁇ 1 are the approximate mean and standard deviation of the false speaker score distribution, respectively.
  • the following three normalization methods can be used to normalize the likelihood ratio score:
  • Zero normalization uses a batch of non-target utterances for the target model to calculate the average ⁇ 1 and the standard deviation ⁇ 1 ; that is, by estimating the mean and variance of the impostor’s score distribution and performing the score distribution The regularization scheme of linear transformation.
  • Test normalization uses feature vectors of unknown speakers to calculate statistical data on a set of fake speaker models; that is, normalizes based on the mean and standard deviation of the imposter score distribution.
  • the difference from Z-Norm is that T-Norm uses a large number of impostor speaker models instead of impostor speech data to calculate the mean and standard deviation.
  • the normalization process is performed during recognition. A piece of test speech data is simultaneously compared with the claimed speaker model and a large number of impostor models to obtain the impostor scores, and then calculate the impostor score distribution and the normalization parameter ⁇ 1 and ⁇ 1 .
  • S204 may include: determining the identity recognition result of the audio information based on the likelihood ratio score; adopting The likelihood ratio verification method checks whether the identity recognition result is credible, and when the identity recognition result is verified to be credible, outputs the identity recognition result; wherein, when the identity recognition result is verified as not credible, return to S201, or end Process.
  • FIG. 3 is a schematic diagram of the null hypothesis and the alternative hypothesis provided by an embodiment of the present application.
  • the verification is regarded as a hypothesis testing problem with a null hypothesis H 0 , where the audio vector i has the same speaker and digital latent variables u i and v j and an alternative hypothesis H 1 . Both i and j are positive integers greater than or equal to 1.
  • null hypothesis H 0 one person corresponds to one number; alternative hypothesis H 1 corresponds to multiple numbers, or one number corresponds to multiple persons, or multiple persons and multiple numbers are mixed.
  • U 1 , U 2 , V 1 , and V 2 represent different speakers.
  • Xt represents a person said a number; for example: i said a number j.
  • Xs represents that the number is said by a person; for example: the number j is said by i.
  • ⁇ t represents the error of Xt, and ⁇ s represents the error of Xs.
  • the identification result of the audio information is determined to be accurate and credible, and the identification result is output.
  • S204 it may also include S205: when the identity recognition result is credible and the identity verification passes, respond to the voice control instruction from the speaker corresponding to the audio information, and execute the voice control The preset operation corresponding to the instruction.
  • the terminal judges whether the speaker corresponding to the identity recognition result is a legal user based on the preset legal identity information, and when the speaker corresponding to the identity recognition result is a legal user, the verification is determined to pass. Afterwards, when the voice control instruction input by the speaker is acquired, the voice control instruction is responded to, the preset operation corresponding to the voice control instruction is acquired, and the preset operation corresponding to the voice control instruction is executed.
  • the voice control instruction is a search instruction, and when the search instruction is used to search for an item, in response to the search instruction, search for related information of an item corresponding to the search instruction from a local database or a network database.
  • the speaker latent variable and the digital latent variable of the audio information to be recognized are extracted; when the speaker latent variable meets the preset requirements, the digital latent variable is input into the preset Bayesian model for processing to obtain audio The likelihood ratio score of the information; the identification result of the audio information is output based on the likelihood ratio score. Since the preset requirement is set based on the value of the latent variable of the speaker corresponding to the clearly identifiable audio information, when the latent variable of the speaker in the audio information meets the preset requirement, the effect of the speaker's own performance on the digital pronunciation can be excluded. Interference. At this time, the speaker’s identity information is identified based on the digital latent variable of each number spoken by the testee.
  • the speaker can pronounce the same number at different times. Different, it can also accurately identify the identity of the speaker, which can avoid the situation that different speakers have different pronunciations of the same number, and the speakers have different pronunciations of the same number at different times, which interferes with the identification result, and can improve the identity The accuracy of the recognition result. Outputting the identity recognition result of the audio information based on the likelihood ratio score of the audio information can reduce the probability of misjudgment and further improve the accuracy of the recognition result.
  • FIG. 4 is a schematic diagram of a terminal provided by an embodiment of the present application.
  • the units included in the terminal are used to execute the steps in the embodiments corresponding to FIGS. 1 to 2.
  • the terminal 4 includes:
  • the acquiring unit 410 is configured to acquire the audio information to be recognized that the test subject speaks for a reference number string; wherein the reference number string is pre-stored and randomly played or displayed randomly, and the audio information includes the test The audio corresponding to the number string spoken by the person;
  • the extraction unit 420 is used to extract the speaker latent variable and the digital latent variable of the audio information; wherein, the speaker latent variable is used to identify the characteristic information of the speaker, and the digital latent variable is used to confirm what the testee says When the number string is the same as the reference number string, the digital latent variable is used to identify the pronunciation characteristics of the number of the test subject in the audio information;
  • the recognition unit 430 is configured to input the digital latent variable into a preset Bayesian model for voiceprint recognition when the latent variable of the loudspeaker meets a preset requirement, and obtain an identity recognition result; wherein, the preset requirement is based on The value of the loudspeaker latent variable corresponding to the clearly identifiable audio information is set; the Bayesian model is obtained by training the digital latent variable of each number spoken by a single speaker in the sound sample set by using a machine learning algorithm; Each digital latent variable has an identity tag identifying the speaker to which the digital latent variable belongs; the Bayesian model has a corresponding relationship with the single speaker in the sound sample set.
  • the extraction unit 420 includes:
  • the first detection unit is configured to extract the latent variable of the speaker from the audio information, and detect whether the audio information is qualified based on the value of the latent variable of the speaker;
  • the second detection unit is configured to detect the difference between the number string spoken by the test person and the reference number string and the audio corresponding to the number string spoken by the person under test when the audio information is qualified. Whether the reference number strings are the same;
  • the latent variable extraction unit is used to extract digital latent variables from the audio information when the detection results are the same.
  • the identification unit 430 includes:
  • a calculation unit configured to input the digital latent variable into a preset Bayesian model for processing to obtain the likelihood ratio score of the audio information
  • the identity recognition unit is configured to output the identity recognition result of the audio information based on the likelihood ratio score.
  • the identity recognition unit is specifically configured to: determine the identity recognition result of the audio information based on the likelihood ratio score; use a likelihood ratio verification method to check whether the identity recognition result is credible, and verify the identity When the result is credible, the identity recognition result is output; wherein, when the identity recognition result is verified as not credible, the process is terminated or the obtaining unit 410 executes the obtaining of the to-be-identified audio information spoken by the testee for the reference digital string.
  • Fig. 5 is a schematic diagram of a terminal provided by another embodiment of the present application.
  • the terminal 5 of this embodiment includes a processor 50, a memory 51, and computer-readable instructions 52 that are stored in the memory 51 and can run on the processor 50.
  • the steps in the above embodiments of the speaker recognition method of each terminal are implemented, for example, S101 to S103 shown in FIG. 1.
  • the processor 50 executes the computer-readable instructions 52, the functions of the units in the foregoing device embodiments, such as the functions of the units 410 to 430 shown in FIG.
  • the computer-readable instruction 52 may be divided into one or more units, and the one or more units are stored in the memory 51 and executed by the processor 50 to complete the application .
  • the one or more units may be an instruction segment of a series of computer-readable instructions capable of completing specific functions, and the instruction segment is used to describe the execution process of the computer-readable instruction 52 in the terminal 4.
  • the computer-readable instruction 52 may be divided into an acquisition unit, an extraction unit, and an identification unit. The specific functions of each unit are as described above. For details, please refer to the description in the embodiment corresponding to FIG. 4.
  • the terminal may include, but is not limited to, a processor 50 and a memory 51.
  • FIG. 5 is only an example of the terminal 5, and does not constitute a limitation on the terminal 5. It may include more or less components than shown in the figure, or a combination of certain components, or different components, such as
  • the terminal may also include input and output terminals, network access terminals, buses, and so on.
  • the so-called processor 50 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 51 may be an internal storage unit of the terminal 5, such as a hard disk or a memory of the terminal 5.
  • the memory 51 may also be an external storage terminal of the terminal 5, such as a plug-in hard disk equipped on the terminal 5, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD) card, Flash Card, etc. Further, the memory 51 may also include both an internal storage unit of the terminal 5 and an external storage terminal.
  • the memory 51 is used to store the computer-readable instructions and other programs and data required by the terminal.
  • the memory 51 can also be used to temporarily store data that has been output or will be output.
  • Non-volatile memory may include Read-Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory.
  • ROM Read-Only Memory
  • PROM Programmable ROM
  • EPROM Electrically Programmable ROM
  • EEPROM Electrically Erasable Programmable ROM
  • Volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory.
  • RAM Random Access Memory
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Collating Specific Patterns (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种识别说话人的方法及终端,该方法包括:获取待测者针对基准数字串说出的待识别的音频信息(S101);音频信息包括数字串;提取音频信息的扬声器潜变量以及数字潜变量(S102);扬声器潜变量用于标识扬声器的特征信息,数字潜变量用于标识音频信息中待测者对数字的发音特征;当扬声器潜变量符合预设要求时,将数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果(103)。该方法基于音频信息中的扬声器潜变量以及数字潜变量识别说话人的身份信息,能够避免因不同扬声器对于相同的数字具有不同的发音,以及说话人对于相同数字在不同时刻的发音不同,从而干扰身份识别结果的情况,能够提高身份识别结果的准确度。

Description

一种识别说话人的方法及终端
本申请要求于2019年04月29日提交中国专利局、申请号为201910354414.7、发明名称为“一种识别说话人的方法及终端”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于计算机技术领域,尤其涉及一种识别说话人的方法及终端。
背景技术
随着信息技术和网络技术的迅猛发展,人们对身份识别技术的需求越来越多。基于传统密码认证的身份识别技术在实际应用中已经暴露出许多不足之处(例如安全可靠性较低),而基于生物特征辨别的身份识别技术近年来也日益成熟并在实际应用中展现出其优越性。其中,声纹识别技术便是基于生物特征辨别的身份识别技术之一。
声纹是指说话人语音频谱的信息图。由于每个人的发音器官不同,所发出来的声音及其音调各不相同,因此,以声纹作为基本特征进行身份识别具有不可替代性和稳定性。声纹识别有文本相关的(Text-Dependent)和文本无关的(Text-Independent)两种。
与语音内容不受约束的文本无关的说话人识别相反,文本相关的说话人验证系统更有利于安全应用,因为它们在短时会话中往往能够表现出更高的准确性。
典型的文本相关说话人识别是让每个用户使用固定短语,来匹配注册和测试短语。对于这种情况,可以预先记录来自用户的话语,然后播放它。在训练和测试话语具有不同场景的情况下,通过共享相同的语音内容,可以在一定程度上提高识别防护的安全性,系统会随机给出一些数字串,用户需正确念出对应的内容才可识别声纹,这种随机性的引入使得文本相关识别中每一次采集到的声纹都有内容时序上的差异。然而,在使用随机提示数字串识别说话人时,由于部分样本数字词汇固定且数字样本有限,而且对于相同的数字,不同的扬声器具有不同的发音,这样会导致无法准确识别说话人的身份。
技术问题
本申请实施例提供了一种识别说话人的方法及终端,以解决现有技术中,对于复杂的声纹语音信息(例如,短话音、模仿语音等),文本无关型的声纹识别系统无法准确提取说话人的语音特征,从而导致无法准确识别说话人的身份的问题。
技术解决方案
有鉴于此,本申请实施例提供了一种识别说话人的方法及终端,以解决现有技术中,对于复杂的声纹语音信息(例如,短话音、模仿语音等),文本无关型的声纹识别系统无法准 确提取说话人的语音特征,从而导致无法准确识别说话人的身份的问题。
本申请实施例的第一方面提供了一种识别说话人的方法,包括:
获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频;
提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量用于标识扬声器的特征信息,所述数字潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征;
当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果;其中,所述预设要求基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置;所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到;所述每个数字潜变量具有标识该数字潜变量所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
本申请实施例的第二方面提供了一种终端,包括:
获取单元,用于获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频;
提取单元,用于提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量用于标识扬声器的特征信息,所述数字潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征;
识别单元,用于当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果;其中,所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到;所述每个数字潜变量具有标识该数字潜变量对应所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
本申请实施例的第三方面提供了一种终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:
获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频;
提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量用于标识扬声器的特征信息,所述数字潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征;
当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果;其中,所述预设要求基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置;所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到;所述每个数字潜变量具有标识该数字潜变量所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
本申请实施例的第四方面提供了一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤:
获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频;
提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量用于标识扬声器的特征信息,所述数字潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征;
当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果;其中,所述预设要求基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置;所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到;所述每个数字潜变量具有标识该数字潜变量所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
有益效果
本申请实施例,通过提取待识别的音频信息的扬声器潜变量以及数字潜变量;当扬声器潜变量符合要求时,将数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果。由于预设要求是基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置,当音频信息中的扬声器潜变量符合预设要求时,可以排除扬声器本身性能对数字发音对身份识别结果的干扰,此时基于待测者说出的每个数字的数字潜变量识别说话人的身份信息,由于每个数字的数字潜变量可以有多个,因此,即时说话人在不同时刻对于相同数字发音不同,也能够准确识别说话人的身份,能够避免因不同的扬声器对于相同的数字具有不同的发音,以及说 话人对于相同数字在不同时刻的发音不同,从而干扰身份识别结果的情况,能够提高身份识别结果的准确度。
附图说明
图1是本申请一实施例提供的一种识别说话人的方法的实现流程图;
图2是本申请另一实施例提供的一种识别说话人的方法的实现流程图;
图3是本申请一实施例提供的零假设和备择假设的示意图;
图4是本申请一实施例提供的一种终端的示意图;
图5是本申请另一实施例提供的一种终端的示意图。
本发明的实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
请参见图1,图1是本申请实施例提供的一种识别说话人的方法的实现流程图。本实施例中识别说话人的方法的执行主体为终端。终端包括但不限于智能手机、平板电脑、可穿戴设备等移动终端,还可以是台式电脑等。如图所示的识别说话人的方法可包括:
S101:获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频。
终端在检测到说话人识别指令时,可以通过内置的声音拾取装置(例如,麦克风、扬声器)获取周围环境中的说话人发出的待识别的音频信息,此时,说话人发出的音频信息是依据终端随机给出的基准数字串发出;或者终端根据说话人识别指令中包含的文件标识获取该文件标识对应的音频文件或视频文件,并提取音频文件或视频文件中的音频信息,将其识别为待识别的音频信息。音频文件或视频文件中包含待测者念出基准数字串得到的音频信息,音频文件或视频文件可以是用户上传的,也可以从用于存储音频文件或视频文件的服务器中下载得到。基准数字串是预先存储在终端中,并由终端随机播放或随机显示。基准数字串的数量可以包括多个。
待识别的音频信息包括数字串对应的音频,数字串由至少一个数字组成。
S102:提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识扬声器的特征信息,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征。
终端基于获取到的音频信息,计算音频信息的扬声器变量。扬声器潜变量包括但不限于 信噪比,还可以包括扬声器的效率、声压级等。
信噪比(signal-to-noise ratio)是描述信号中有效成分与噪声成分的比例关系参数。其中,扬声器的信噪比越高,扬声器拾取的声音越清晰。例如,终端从音频信息中提取正常的声音信号以及无信号时的噪声信号,基于正常的声音信号以及噪声信号,计算音频信息的信噪比。
终端可以将获取到的音频信息输入预先训练好的深度神经网络(Deep Neural Networks,DNN)模型,通过深度神经网络提取音频信息中每个数字的数字潜变量。数字潜变量用于标识同一数字的发音特征。同一个数字可以具有至少两个不同的数字潜变量,即具有至少两个不同的发音。例如,数字“1”的发音包括“一”、“幺”等。
本实施例中,深度神经网络模型包括输入层、隐含层和输出层。输入层包括一个输入层节点,用于从外部接收输入的音频信息。隐含层包括两个以上的隐含层节点,用于对音频信息进行处理,提取音频信息的数字潜变量。输出层用于输出处理结果,即输出音频信息的数字潜变量。
深度神经网络模型基于声音样本集训练得到,声音样本集中包括说话人说出每个数字对应的声音样本。说话人的数目可以为多个,例如500或1500,在一定程度上训练的样本数量越多,使用训练得到的神经网络模型进行识别时,结果越准确。声音样本中包括预设数目的声音样本,每个声音样本具有标记的数字潜变量,每个数字潜变量与样本标签一一对应。声音样本可以只包括一个数字,声音样本集中的每个数字可以对应有至少两个数字潜变量。声音样本包括的数字包括所有可能随机生成的基准数字串包含的数字。例如,随机生成的基准数字串包括0~9,10个数字中的任意6个数字时,声音样本集中包括0~9这10个数字的声音样本。
在训练的过程中,输入深度神经网络模型的输入数据为音频信息,该音频信息可以是基于音频信息得到的向量矩阵,向量矩阵由依次从音频信息中提取的包含数字的音频数据对应的向量组成。深度神经网络模型的输出数据为音频信息中每个数字的数字潜变量。
可以理解的是,一个说话人对应一个神经网络识别模型,当需要识别多个说话人时,训练多个说话人各自对应的神经网络识别模型。
终端在提取出扬声器潜变量时,基于预设条件判断提取的扬声器潜变量是否符合预设要求。预设条件用于判断音频信息中的声音是否清晰能够辨认,预设条件可以基于清晰能够辨认的声音所对应的扬声器潜变量的值进行设置。例如,当扬声器潜变量为信噪比时,预设条件可以是信噪比大于或等于预设信噪比阈值。预设信噪比阈值为清晰能够辨认的声音所对应的信噪比。
当扬声器潜变量符合要求时,执行S103;当扬声器潜变量不符合要求时,可以返回S101, 也可以输出身份识别结果为当前的说话人与基准数字串对应的说话人不匹配。
S103:当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果;其中,所述预设要求基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置;所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到;所述每个数字潜变量具有标识该数字潜变量所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
例如,当说话人说出某数字具有该数字潜变量时,那么,该数字潜变量的身份标签表示:说出该数字的说话人的身份。
在本实施例中,预设的贝叶斯模型的表达式如下:
p(x ijk|u i,v j,θ)=N(x ijk|μ+u i+v j,∑ ε);其中,p(u i)=N(u i|0,∑ u);p(v j)=N(v j|0,∑ v)。
其中,p(x ijk|u i,v j,θ)表示一个人说了一个数字的概率,x ijk表示在第k个会话中第i个人说了第j个数字。由于做身份验证或者录入信息时,可能会要求说话人说多次不同的数字串,因此,k表示第k个会话,θ是贝叶斯模型中参数的统称。条件概率N(x ijk|μ+u i+v j,∑ ε)表示符合均值为μ+u i+v j,方差为∑ ε的高斯分布;∑ ε表示ε的对角协方差;信号分量x ijk=μ+u i+v j+ εijk,即代表信号分量x ijk取决于说话人和数字;噪声分量ε ijk表示在第k个会话中第i个人说第j个数字的偏差或噪声,μ表示训练向量的总体平均值。
u i表示说话者i的数字潜在变量,u i被定义为具有对角协方差∑ u的高斯,p(u i)表示说话人是i的概率,N(u i|0,∑ u)是关于∑ u的高斯分布;v j表示说话者j的数字潜在变量,是v j被定义为具有对角协方差∑ v的高斯,p(v j)表示是说话人j的概率,N(v j|0,∑ v)是关于∑ v的高斯分布。
形式上,贝叶斯模型可以用条件概率来描述N(x|μ,Σ)表示x中的高斯,具有平均μ和协方差Σ。
终端在完成迭代计算说话人说了音频信息中包含的每个数字各自对应概率时,基于每次迭代计算得到的每个数字对应概率对说话人进行身份识别,得到身份识别结果。
例如,迭代总次数为10次,说话者i说了一个数字的概率大于或等于预设概率阈值(例如0.8)时,记1分,说话者i说了一个数字的概率小于预设概率阈值(例如0.8)时,记0分,统计10次迭代计算后说话者i的总得分,当总得分大于或等于预设分值阈值(例如7分)时,判定音频信息对应的说话人为说话者i。
本申请实施例,通过提取待识别的音频信息的扬声器潜变量以及数字潜变量;当扬声器潜变量符合要求时,将数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果。由于预设要求是基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置,当音频信息中的扬声器潜变量符合预设要求时,可以排除因扬声器本身对数字发音因的不同而对身份识别结果的干扰,此时基于待测者说出的每个数字的数字潜变量识别说话人的身份信息,由于每个数字的数字潜变量可以有多个,因此,即时说话人在不同时刻对于相同数字发音不同,也能够准确识别说话人的身份,能够避免因不同的扬声器对于相同的数字具有不同的发音,以及说话人对于相同数字在不同时刻的发音不同,从而干扰身份识别结果的情况,能够提高身份识别结果的准确度。
请参见图2,图2是本申请另一实施例提供的一种识别说话人的方法的实现流程图。本实施例中识别说话人的方法的执行主体为终端。终端包括但不限于智能手机、平板电脑、可穿戴设备等移动终端,还可以是台式电脑等。本实施例的识别说话人的方法包括以下步骤:
S201:获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频。
本实施例中S201与上一实施例中S101相同,具体请参阅上一实施例中S101的相关描述,此处不赘述。
S202:提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识扬声器的特征信息,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征。
本实施例中S202与上一实施例中S102相同,具体请参阅上一实施例中S102的相关描述,此处不赘述。
进一步地,S202可以包括S2021~S2023。具体如下:
S2021:从所述音频信息中提取扬声器潜变量,并基于所述扬声器潜变量的值检测音频信息是否合格。
具体地,终端可以从待识别的音频信息中提取扬声器潜变量,并基于提取的扬声器潜变量的值、预设的扬声器潜变量阈值,检测音频信息是否合格,以确认音频信息中的声音是否清晰能够辨认。扬声器潜变量用于标识扬声器的特征信息。扬声器潜变量包括但不限于信噪比,还可以包括扬声器的效率、声压级等。预设的扬声器潜变量阈值是基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置。
信噪比是描述信号中有效成分与噪声成分的比例关系参数。扬声器的信噪比越高,扬声器拾取的声音越清晰。例如,终端从音频信息中提取正常的声音信号以及无信号时的噪声信 号,基于正常的声音信号以及噪声信号,计算音频信息的信噪比。
例如,当扬声器潜变量为信噪比时,终端在检测到扬声器的信噪比的值大于或等于70时,判定待识别的音频信息合格,可清晰辨认音频信息中的数字。
当检测结果为音频信息不合格时,提示说话者重读随机的基准数字串,以重新获取音频信息;或者从存储音频数据对应的数据库中重新获取待识别的音频信息。其中,基准数字串是终端在识别说话人身份的过程中随机生成或从数据库中随机获取,并提示给用户的数字串,终端可以在S101之前终端随机播放或显示该基准数字串,当采用语音播报的方式播放基准数字串时,以标准发音播放基准数字串。基准数字串包括预设数目的数字,例如,基准数字串中包括5或6个数字。
当检测结果为音频信息合格时,执行S2022。
S2022:当所述音频信息合格时,基于所述基准数字串以及所述待测者说出的数字串对应的音频,检测所述待测者说出的数字串与所述基准数字串是否相同。
当检测结果为音频信息合格时,终端可以从音频信息中依次提取包含数字的语音片段,采用语音识别技术识别音频信息所包含的数字串,将基准数字串与识别出的数字串进行比较,判断音频信息包含的数字串与基准数字串是否相同。
或者,终端还可以播放该基准数字串,得到基准数字串对应的音频,将基准数字串对应的音频与待测者说出的数字串对应的音频进行比较,检测待测者说出的数字串与基准数字串是否相同。通过播放该基准数字串得到的音频,与采集到待测者说出的数字串对应的音频进行比较,终端在拾取音频和播放音频信息时,可以减少因扬声器本身的性能对数字发音偏差。
当识别出的数字串中的任一数字与基准数字串中的任一数字不同,或多个数字的排列顺序不相同时,判定音频信息包含的数字串与基准数字串不同。此时,可以返回S201重新获取待识别的语音数据;也可以输出身份识别结果为当前的说话人与基准数字串对应的说话人不匹配。
当基准数字串中的每个数字与识别出的数字串中的每个数字相同,且多个数字的排列顺序也相同时,判定音频信息包含的数字串与基准数字串相同,执行S2023。
S2023:当检测结果为相同时,从所述音频信息中提取数字潜变量。
终端在确认音频信息包含的数字串与基准数字串相同时,将音频信息转换成矩阵向量输入DNN模型进行处理,从音频信息中提取数字串中的每个数字的数字潜变量。数字潜变量用于标识该音频信息中待测者对数字的发音特征。
具体地,终端可以将获取到的音频信息输入预先训练好的DNN模型,通过深度神经网络提取音频信息中每个数字的数字潜变量。数字潜变量用于标识同一数字的发音特征。同一 个数字可以具有至少两个不同的数字潜变量,即具有至少两个不同的发音。例如,数字“1”的发音包括“一”、“幺”等。
本实施例中,深度神经网络模型包括输入层、隐含层和输出层。输入层包括一个输入层节点,用于从外部接收输入的音频信息。隐含层包括两个以上的隐含层节点,用于对音频信息进行处理,提取音频信息的数字潜变量。输出层用于输出处理结果,即输出音频信息的数字潜变量。
深度神经网络模型基于声音样本集训练得到,声音样本集中包括说话人说出每个数字对应的声音样本。说话人的数目可以为多个,例如500或1500,在一定程度上训练的样本数量越多,使用训练得到的神经网络模型进行识别时,结果越准确。声音样本中包括预设数目的声音样本,每个声音样本具有标记的数字潜变量,每个数字潜变量与样本标签一一对应。声音样本可以只包括一个数字,声音样本集中的每个数字可以对应有至少两个数字潜变量。声音样本包括的数字包括所有可能随机生成的基准数字串包含的数字。例如,随机生成的基准数字串包括0~9,10个数字中的任意6个数字时,声音样本集中包括0~9这10个数字的声音样本。
在训练的过程中,输入深度神经网络模型的输入数据为音频信息,该音频信息可以是基于音频信息得到的向量矩阵,向量矩阵由依次从音频信息中提取的包含数字的音频数据对应的向量组成。深度神经网络模型的输出数据为音频信息中每个数字的数字潜变量。
可以理解的是,一个说话人对应一个神经网络识别模型,当需要识别多个说话人时,训练多个说话人各自对应的神经网络识别模型。
S203:当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行处理,得到所述音频信息的似然比分数;其中,所述预设要求基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置;所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到;所述每个数字潜变量具有标识该数字潜变量所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
终端将音频信息包含的每个数字的数字潜变量输入预设的贝叶斯模型,并通过以下公式p(x ijk|u i,v j,θ)=N(x ijk|μ+u i+v j,∑ ε)计算说话人说了音频信息中包含的每个数字各自对应概率。其中,p(u i)=N(u i|0,∑ u);p(v j)=N(v j|0,∑ v)。
其中,p(x ijk|u i,v j,θ)表示一个人说了一个数字的概率,x ijk表示在第k个会话中第i个人说了第j个数字。由于做身份验证或者录入信息时,可能会要求说话人说多次不同的数 字串,因此,k表示第k个会话,θ是贝叶斯模型中参数的统称。条件概率N(x ijk|μ+u i+v j,∑ ε)表示符合均值为μ+u i+v j,方差为∑ ε的高斯分布;∑ ε表示ε的对角协方差;信号分量x ijk=μ+u i+v jijk,即代表信号分量x ijk取决于说话人和数字;噪声分量ε ijk表示在第k个会话中第i个人说第j个数字的偏差或噪声,μ表示训练向量的总体平均值。
u i表示说话者i的数字潜在变量,u i被定义为具有对角协方差∑ u的高斯,p(u i)表示说话人是i的概率,N(u i|0,∑ u)是关于∑ u的高斯分布;v j表示说话者j的数字潜在变量,是v j被定义为具有对角协方差∑ v的高斯,p(v j)表示是说话人j的概率,N(v j|0,∑ v)是关于∑ v的高斯分布。
形式上,贝叶斯模型可以用条件概率来描述N(x|μ,Σ)表示x中的高斯,具有平均μ和协方差Σ。
终端在计算出说话人说了音频信息中包含的每个数字各自对应概率时,基于计算得到的概率计算似然比分数。
似然比分数表示说话人i说了第j个数字的概率与不是说话人i说了第j个数字的概率之比。例如,如果说话人i说了第j个数字的概率为0.6,那么不是说话人i说了第j个数字的概率为0.4,似然比分数为0.6/0.4=1.5。
可以理解的是,终端可以按预设的迭代次数(例如)计算说话人说了音频信息中包含的每个数字各自对应概率,其中,预设的迭代次数可以为10次,也可以按实际需要进行设置。
此时,终端可以基于说话人多次说了第j个数字各自对应的似然比分数,计算说话人说了第j个数字的平均似然比分数,并将该平均似然分数比作为计算说话人说了第j个数字的似然比分数。
为了提高身份识别结果的准确度,终端还可以基于说话人多次说了第j个数字各自对应的似然比分数,筛选出大于或等于预设似然比分数阈值的似然比分数,并计算筛选出的似然比分数的均值,将计算得到的均值作为说话人说了第j个数字的似然比分数。预设似然比分数阈值可以为1.2,但并不限于此,具体可根据实际情况进行设置,此处不做限制。
S204:基于所述似然比分数输出所述音频信息的身份识别结果。
由于似然比分数表示说话人i说了第j个数字的概率与不是说话人i说了第j个数字的概率之比,因此,终端可以在确认说话人i说了第j个数字的似然比分数大于1时,判定音频信息的说话人为说话人i。当说话人i说了第j个数字的似然比分数小于或等于1时,判定音频信息的说话人为不是说话人i。
进一步地,在一实施方式中,为了便于使用共同的似然比分数阈值识别说话人身份,以 提高识别效率。S204可以包括:采用公式s'=(s-μ 1)/δ 1对所述似然比分数进行归一化处理,并基于所述归一化处理后的似然比分数输出所述音频信息的身份识别结果;其中,s'为所述归一化处理后的似然比分数,s为所述似然比分数,μ 1为假的说话者得分分布的近似平均值,δ 1为假的说话者得分分布的标准偏差。
终端通过使用上式将来自不同说话者的似然比分数转换为相似的范围,从而能够使用共同的似然比分数阈值进行判断。其中μ 1和σ 1分别是假的说话者得分分布的近似平均值和标准偏差。其中,可以通过以下三种归一化方法进行对似然比分数进行归一化处理:
1)零归一化(Z-Norm)使用一批针对目标模型的非目标话语来计算平均μ 1和标准差σ 1;即,通过估计冒认者得分分布的均值和方差并对得分分布进行线性变换的规整方案。Z-Norm针对某个说话人概率模型,利用大批的冒认者说话语音数据进行测试,得到冒充者得分分布。然后,从冒充得分分布中计算得到均值μ 1和标准差σ 1,然后将μ 1、σ 1代入公式s'=(s-μ 1)/δ 1计算归一化处理后的似然比分数。
2)测试归一化(T-Norm)使用未知说话者的特征向量对一组假的说话者模型来计算统计数据;即,基于冒充者得分分布的均值和标准差进行归一化。与Z-Norm的不同之处在于,T-Norm是利用大量冒充者说话人模型而不是冒充者语音数据来计算均值和标准方差。归一化过程是在识别时进行的,一条测试语音数据同时和所声称的说话人模型及大量冒充者模型进行比较,分别取得冒充者得分,进而计算出冒充者得分分布和归一化参数μ 1和σ 1
3)将由Z-Norm和T-Norm计算得到的归一化值的均值作为最终的归一化值,以形成似然比分数s归一化后的分数s'。其中,z-norm以及T-Norm为现有技术,此处不再一一赘述。
进一步地,在另一实施方式中,为了提高识别结果的可信度与准确度,减少误判的概率,S204可以包括:基于所述似然比分数确定所述音频信息的身份识别结果;采用似然比验证方法检验所述身份识别结果是否可信,并在验证所述身份识别结果可信时,输出身份识别结果;其中,当验证所述身份识别结果不可信时,返回S201,或者结束流程。
请一并参阅图3,图3是本申请一实施例提供的零假设和备择假设的示意图。
具体地,将验证视为一种具有零假设H 0的假设检验问题,其中,音频向量i具有相同的说话者和数字潜变量u i和v j以及备择假设H 1。i、j均为大于或等于1的正整数。
零假设H 0中一个人对应一个说的数字;备择假设H 1中一个人对应多个数字,或者一个数字对应多个人,或者多个人多个数字混合。U 1、U 2、V 1、V 2表示不同的说话人。
终端可以通过比较如图3所示的不同假设下数据的可能性进行验证。Xt代表一个人说了 一个数字;例如:i说了j数字。Xs代表这个数字是一个人说的;例如:j数字是i说的。ε t表示Xt的误差,ε s表示Xs的误差。
在零假设H 1下,特征Xt和Xs不匹配。
根据假设H 0,两边判断一致,特征Xt和Xs匹配。
当特征Xt和Xs匹配时,判定音频信息的身份识别结果是准确、可信的,输出身份识别结果。
当特征Xt和Xs不匹配时,判定音频信息的身份识别结果是不可信的,可能存在误判,此时返回S201或结束识别说话人身份的流程。
进一步地,在S204之后,还可以包括S205:当所述身份识别结果为可信,且身份校验通过时,响应来自所述音频信息对应的说话人的语音控制指令,并执行所述语音控制指令对应的预设操作。
终端在获取到身份识别结果可信,基于预置的合法身份信息判断该身份识别结果对应的说话人是否为合法用户,当该身份识别结果对应的说话人为合法用户时,判定校验通过。之后,在获取到该说话人输入的语音控制指令时,响应该语音控制指令,获取该语音控制指令对应的预设操作,并执行该语音控制指令对应的预设操作。
例如,语音控制指令为搜索指令,且该搜索指令用于搜索某物品时,响应该搜索指令,从本地数据库或网络数据库搜索该搜索指令对应的某物品的相关信息。
本申请实施例,通过提取待识别的音频信息的扬声器潜变量以及数字潜变量;当扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行处理,得到音频信息的似然比分数;基于似然比分数输出音频信息的身份识别结果。由于预设要求是基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置,当音频信息中的扬声器潜变量符合预设要求时,可以排除扬声器本身性能对数字发音对身份识别结果的干扰,此时基于待测者说出的每个数字的数字潜变量识别说话人的身份信息,由于每个数字的数字潜变量可以有多个,因此,即时说话人在不同时刻对于相同数字发音不同,也能够准确识别说话人的身份,能够避免因不同的扬声器对于相同的数字具有不同的发音,以及说话人对于相同数字在不同时刻的发音不同,从而干扰身份识别结果的情况,能够提高身份识别结果的准确度。基于音频信息的似然比分数输出该音频信息的身份识别结果,能够减小误判几率,进一步提高识别结果的准确性。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
请参阅图4,图4是本申请一实施例提供的一种终端的示意图。终端包括的各单元用于执行图1~图2对应的实施例中的各步骤。具体请参阅图1~图2各自对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图4,终端4包括:
获取单元410,用于获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频;
提取单元420,用于提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量用于标识扬声器的特征信息,所述数字潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征;
识别单元430,用于当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果;其中,所述预设要求基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置;所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到;每个数字潜变量具有标识该数字潜变量所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
进一步地,提取单元420包括:
第一检测单元,用于从所述音频信息中提取扬声器潜变量,并基于所述扬声器潜变量的值检测音频信息是否合格;
第二检测单元,用于当所述音频信息合格时,基于所述基准数字串以及所述待测者说出的数字串对应的音频,检测所述待测者说出的数字串与所述基准数字串是否相同;
潜变量提取单元,用于当检测结果为相同时,从所述音频信息中提取数字潜变量。
进一步地,识别单元430包括:
计算单元,用于将所述数字潜变量输入预设的贝叶斯模型进行处理,得到所述音频信息的似然比分数;
身份识别单元,用于基于所述似然比分数输出所述音频信息的身份识别结果。
进一步地,身份识别单元具体用于:采用公式s'=(s-μ 1)/δ 1对所述似然比分数进行归一化处理,并基于所述归一化处理后的似然比分数输出所述音频信息的身份识别结果;其中,s'为所述归一化处理后的似然比分数,s为所述似然比分数,μ 1为假的说话者得分分布的近似平均值,δ 1为假的说话者得分分布的标准偏差。
进一步地,身份识别单元具体用于:基于所述似然比分数确定所述音频信息的身份识别结果;采用似然比验证方法检验所述身份识别结果是否可信,并在验证所述身份识别结果可信时,输出身份识别结果;其中,当验证所述身份识别结果不可信时,结束流程或者获取单元410执行所述获取待测者针对基准数字串说出的待识别的音频信息。
图5是本申请另一实施例提供的一种终端的示意图。如图5所示,该实施例的终端5包括:处理器50、存储器51以及存储在所述存储器51中并可在所述处理器50上运行的计算机可读指令52。所述处理器50执行所述计算机可读指令52时实现上述各个终端的识别说话人的方法实施例中的步骤,例如图1所示的S101至S103。或者,所述处理器50执行所述计算机可读指令52时实现上述各装置实施例中各单元的功能,例如图4所示单元410至430功能。
示例性的,所述计算机可读指令52可以被分割成一个或多个单元,所述一个或者多个单元被存储在所述存储器51中,并由所述处理器50执行,以完成本申请。所述一个或多个单元可以是能够完成特定功能的一系列计算机可读指令的指令段,该指令段用于描述所述计算机可读指令52在所述终端4中的执行过程。例如,所述计算机可读指令52可以被分割成获取单元、提取单元以及识别单元,各单元具体功能如上所述,具体请参阅图4对应的实施例中的描述。
所述终端可包括,但不仅限于,处理器50、存储器51。本领域技术人员可以理解,图5仅仅是终端5的示例,并不构成对终端5的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端还可以包括输入输出终端、网络接入终端、总线等。
所称处理器50可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器51可以是所述终端5的内部存储单元,例如终端5的硬盘或内存。所述存储器51也可以是所述终端5的外部存储终端,例如所述终端5上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器51还可以既包括所述终端5的内部存储单元也包括外部存储终端。所述存储器51用于存储所述计算机可读指令以及所述终端所需的其他程序和数据。所述存储器51还可以用于暂时地存储已经输出或者将要输出的数据。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机非易失性可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种识别说话人的方法,其特征在于,包括:
    获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频;
    提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量用于标识扬声器的特征信息,所述数字潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征;
    当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果;其中,所述预设要求基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置;所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到,所述每个数字潜变量具有标识该数字潜变量所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
  2. 根据权利要求1所述的方法,其特征在于,所述提取所述音频信息的扬声器潜变量以及数字潜变量,包括:
    从所述音频信息中提取扬声器潜变量,并基于所述扬声器潜变量的值检测音频信息是否合格;
    当所述音频信息合格时,基于所述基准数字串以及所述待测者说出的数字串对应的音频,检测所述待测者说出的数字串与所述基准数字串是否相同;
    当检测结果为相同时,从所述音频信息中提取数字潜变量。
  3. 根据权利要求1或2所述的方法,其特征在于,所述当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果,包括:
    将所述数字潜变量输入预设的贝叶斯模型进行处理,得到所述音频信息的似然比分数;
    基于所述似然比分数输出所述音频信息的身份识别结果。
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述似然比分数输出所述音频信息的身份识别结果,包括:
    采用公式s'=(s-μ 1)/δ 1对所述似然比分数进行归一化处理,并基于所述归一化处理后的似然比分数输出所述音频信息的身份识别结果;其中,s'为所述归一化处理后的似然比分数,s为所述似然比分数,μ 1为假的说话者得分分布的近似平均值,δ 1为假的说话者得分分布的 标准偏差。
  5. 根据权利要求3所述的方法,其特征在于,所述基于所述似然比分数输出所述音频信息的身份识别结果,包括:
    基于所述似然比分数确定所述音频信息的身份识别结果;
    采用似然比验证方法检验所述身份识别结果是否可信,并在验证所述身份识别结果可信时,输出所述身份识别结果;其中,当验证所述身份识别结果不可信时,返回所述获取待测者针对基准数字串说出的待识别的音频信息,或者结束。
  6. 一种终端,其特征在于,包括:
    获取单元,用于获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频;
    提取单元,用于提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量用于标识扬声器的特征信息,所述数字潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征;
    识别单元,用于当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果;其中,所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到;所述每个数字潜变量具有标识该数字潜变量对应所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
  7. 根据权利要求6所述的终端,其特征在于,所述提取单元包括:
    第一检测单元,用于从所述音频信息中提取扬声器潜变量,并基于所述扬声器潜变量的值检测音频信息是否合格;
    第二检测单元,用于当所述音频信息合格时,基于所述基准数字串以及所述待测者说出的数字串对应的音频,检测所述待测者说出的数字串与所述基准数字串是否相同;
    潜变量提取单元,用于当检测结果为相同时,从所述音频信息中提取数字潜变量。
  8. 根据权利要求6或7所述的终端,其特征在于,所述识别单元包括:
    计算单元,用于将所述数字潜变量输入预设的贝叶斯模型进行处理,得到所述音频信息的似然比分数;
    身份识别单元,用于基于所述似然比分数输出所述音频信息的身份识别结果。
  9. 根据权利要求8所述的终端,其特征在于,所述身份识别单元具体用于:采用公式 s'=(s-μ 1)/δ 1对所述似然比分数进行归一化处理,并基于所述归一化处理后的似然比分数输出所述音频信息的身份识别结果;其中,s'为所述归一化处理后的似然比分数,s为所述似然比分数,μ 1为假的说话者得分分布的近似平均值,δ 1为假的说话者得分分布的标准偏差。
  10. 根据权利要求8所述的终端,其特征在于,所述身份识别单元具体用于:基于所述似然比分数确定所述音频信息的身份识别结果;采用似然比验证方法检验所述身份识别结果是否可信,并在验证所述身份识别结果可信时,输出身份识别结果;其中,当验证所述身份识别结果不可信时,结束流程或者通知所述获取单元执行所述获取待测者针对基准数字串说出的待识别的音频信息。
  11. 一种终端,其特征在于,所述终端包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频;
    提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量用于标识扬声器的特征信息,所述数字潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征;
    当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果;其中,所述预设要求基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置;所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到;所述每个数字潜变量具有标识该数字潜变量所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
  12. 根据权利要求11所述的终端,其特征在于,所述提取所述音频信息的扬声器潜变量以及数字潜变量,包括:
    从所述音频信息中提取扬声器潜变量,并基于所述扬声器潜变量的值检测音频信息是否合格;
    当所述音频信息合格时,基于所述基准数字串以及所述待测者说出的数字串对应的音频,检测所述待测者说出的数字串与所述基准数字串是否相同;
    当检测结果为相同时,从所述音频信息中提取数字潜变量。
  13. 根据权利要求11或12所述的终端,其特征在于,所述当扬声器潜变量符合要求时, 将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果,包括:
    将所述数字潜变量输入预设的贝叶斯模型进行处理,得到所述音频信息的似然比分数;
    基于所述似然比分数输出所述音频信息的身份识别结果。
  14. 根据权利要求13所述的终端,其特征在于,所述基于所述似然比分数输出所述音频信息的身份识别结果,包括:
    采用公式s'=(s-μ 1)/δ 1对所述似然比分数进行归一化处理,并基于所述归一化处理后的似然比分数输出所述音频信息的身份识别结果;其中,s'为所述归一化处理后的似然比分数,s为所述似然比分数,μ 1为假的说话者得分分布的近似平均值,δ 1为假的说话者得分分布的标准偏差。
  15. 根据权利要求13所述的终端,其特征在于,所述基于所述似然比分数输出所述音频信息的身份识别结果,包括:
    基于所述似然比分数确定所述音频信息的身份识别结果;
    采用似然比验证方法检验所述身份识别结果是否可信,并在验证所述身份识别结果可信时,输出所述身份识别结果;其中,当验证所述身份识别结果不可信时,返回所述获取待测者针对基准数字串说出的待识别的音频信息,或者结束。
  16. 一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:
    获取待测者针对基准数字串说出的待识别的音频信息;其中,所述基准数字串是预先存储,并随机播放或随机显示,所述音频信息包括所述待测者说出的数字串对应的音频;
    提取所述音频信息的扬声器潜变量以及数字潜变量;其中,所述扬声器潜变量用于标识扬声器的特征信息,所述数字潜变量在确认所述待测者说出的数字串与所述基准数字串相同时提取,所述数字潜变量用于标识所述音频信息中所述待测者对数字的发音特征;
    当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果;其中,所述预设要求基于清晰可辨认的音频信息所对应的扬声器潜变量的值进行设置;所述贝叶斯模型是通过使用机器学习算法对声音样本集中单一说话者说出每个数字的数字潜变量进行训练得到,所述每个数字潜变量具有标识该数字潜变量所属的说话者的身份标签;所述贝叶斯模型与所述声音样本集中的所述单一说话者具有对应关系。
  17. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,所述提取所述音频信息的扬声器潜变量以及数字潜变量,包括:
    从所述音频信息中提取扬声器潜变量,并基于所述扬声器潜变量的值检测音频信息是否合格;
    当所述音频信息合格时,基于所述基准数字串以及所述待测者说出的数字串对应的音频,检测所述待测者说出的数字串与所述基准数字串是否相同;
    当检测结果为相同时,从所述音频信息中提取数字潜变量。
  18. 根据权利要求16或17所述的计算机非易失性可读存储介质,其特征在于,所述当所述扬声器潜变量符合预设要求时,将所述数字潜变量输入预设的贝叶斯模型进行声纹识别,得到身份识别结果,包括:
    将所述数字潜变量输入预设的贝叶斯模型进行处理,得到所述音频信息的似然比分数;
    基于所述似然比分数输出所述音频信息的身份识别结果。
  19. 根据权利要求18所述的计算机非易失性可读存储介质,其特征在于,所述基于所述似然比分数输出所述音频信息的身份识别结果,包括:
    采用公式s'=(s-μ 1)/δ 1对所述似然比分数进行归一化处理,并基于所述归一化处理后的似然比分数输出所述音频信息的身份识别结果;其中,s'为所述归一化处理后的似然比分数,s为所述似然比分数,μ 1为假的说话者得分分布的近似平均值,δ 1为假的说话者得分分布的标准偏差。
  20. 根据权利要求18所述的计算机非易失性可读存储介质,其特征在于,所述基于所述似然比分数输出所述音频信息的身份识别结果,包括:
    基于所述似然比分数确定所述音频信息的身份识别结果;
    采用似然比验证方法检验所述身份识别结果是否可信,并在验证所述身份识别结果可信时,输出所述身份识别结果;其中,当验证所述身份识别结果不可信时,返回所述获取待测者针对基准数字串说出的待识别的音频信息,或者结束。
PCT/CN2019/103299 2019-04-29 2019-08-29 一种识别说话人的方法及终端 WO2020220541A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910354414.7 2019-04-29
CN201910354414.7A CN110111798B (zh) 2019-04-29 2019-04-29 一种识别说话人的方法、终端及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2020220541A1 true WO2020220541A1 (zh) 2020-11-05

Family

ID=67487460

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103299 WO2020220541A1 (zh) 2019-04-29 2019-08-29 一种识别说话人的方法及终端

Country Status (2)

Country Link
CN (1) CN110111798B (zh)
WO (1) WO2020220541A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112820297A (zh) * 2020-12-30 2021-05-18 平安普惠企业管理有限公司 声纹识别方法、装置、计算机设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110111798B (zh) * 2019-04-29 2023-05-05 平安科技(深圳)有限公司 一种识别说话人的方法、终端及计算机可读存储介质
CN110503956B (zh) * 2019-09-17 2023-05-12 平安科技(深圳)有限公司 语音识别方法、装置、介质及电子设备
CN111768789B (zh) * 2020-08-03 2024-02-23 上海依图信息技术有限公司 电子设备及其语音发出者身份确定方法、装置和介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5982297B2 (ja) * 2013-02-18 2016-08-31 日本電信電話株式会社 音声認識装置、音響モデル学習装置、その方法及びプログラム
WO2017168870A1 (ja) * 2016-03-28 2017-10-05 ソニー株式会社 情報処理装置及び情報処理方法
JP2018013722A (ja) * 2016-07-22 2018-01-25 国立研究開発法人情報通信研究機構 音響モデル最適化装置及びそのためのコンピュータプログラム
US9911413B1 (en) * 2016-12-28 2018-03-06 Amazon Technologies, Inc. Neural latent variable model for spoken language understanding
KR101843074B1 (ko) * 2016-10-07 2018-03-28 서울대학교산학협력단 Vae를 이용한 화자 인식 특징 추출 방법 및 시스템
CN109166586A (zh) * 2018-08-02 2019-01-08 平安科技(深圳)有限公司 一种识别说话人的方法及终端
CN110111798A (zh) * 2019-04-29 2019-08-09 平安科技(深圳)有限公司 一种识别说话人的方法及终端

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9318112B2 (en) * 2014-02-14 2016-04-19 Google Inc. Recognizing speech in the presence of additional audio
CN106448685B (zh) * 2016-10-09 2019-11-22 北京远鉴科技有限公司 一种基于音素信息的声纹认证系统及方法
CN106531171B (zh) * 2016-10-13 2020-02-11 普强信息技术(北京)有限公司 一种动态声纹密码系统的实现方法
CN107104803B (zh) * 2017-03-31 2020-01-07 北京华控智加科技有限公司 一种基于数字口令与声纹联合确认的用户身份验证方法
CN109256138B (zh) * 2018-08-13 2023-07-07 平安科技(深圳)有限公司 身份验证方法、终端设备及计算机可读存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5982297B2 (ja) * 2013-02-18 2016-08-31 日本電信電話株式会社 音声認識装置、音響モデル学習装置、その方法及びプログラム
WO2017168870A1 (ja) * 2016-03-28 2017-10-05 ソニー株式会社 情報処理装置及び情報処理方法
JP2018013722A (ja) * 2016-07-22 2018-01-25 国立研究開発法人情報通信研究機構 音響モデル最適化装置及びそのためのコンピュータプログラム
KR101843074B1 (ko) * 2016-10-07 2018-03-28 서울대학교산학협력단 Vae를 이용한 화자 인식 특징 추출 방법 및 시스템
US9911413B1 (en) * 2016-12-28 2018-03-06 Amazon Technologies, Inc. Neural latent variable model for spoken language understanding
CN109166586A (zh) * 2018-08-02 2019-01-08 平安科技(深圳)有限公司 一种识别说话人的方法及终端
CN110111798A (zh) * 2019-04-29 2019-08-09 平安科技(深圳)有限公司 一种识别说话人的方法及终端

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112820297A (zh) * 2020-12-30 2021-05-18 平安普惠企业管理有限公司 声纹识别方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN110111798A (zh) 2019-08-09
CN110111798B (zh) 2023-05-05

Similar Documents

Publication Publication Date Title
Yu et al. Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features
WO2020220541A1 (zh) 一种识别说话人的方法及终端
US11727942B2 (en) Age compensation in biometric systems using time-interval, gender and age
US11869513B2 (en) Authenticating a user
US7603275B2 (en) System, method and computer program product for verifying an identity using voiced to unvoiced classifiers
US20070038460A1 (en) Method and system to improve speaker verification accuracy by detecting repeat imposters
US20070219801A1 (en) System, method and computer program product for updating a biometric model based on changes in a biometric feature of a user
CN109243487B (zh) 一种归一化常q倒谱特征的回放语音检测方法
JPH11507443A (ja) 話者確認システム
Baloul et al. Challenge-based speaker recognition for mobile authentication
TW202213326A (zh) 用於說話者驗證的廣義化負對數似然損失
Orken et al. Development of security systems using DNN and i & x-vector classifiers
WO2020140609A1 (zh) 一种语音识别方法、设备及计算机可读存储介质
Georgescu et al. GMM-UBM modeling for speaker recognition on a Romanian large speech corpora
Saleema et al. Voice biometrics: the promising future of authentication in the internet of things
US11929077B2 (en) Multi-stage speaker enrollment in voice authentication and identification
JPWO2020003413A1 (ja) 情報処理装置、制御方法、及びプログラム
Varchol et al. Multimodal biometric authentication using speech and hand geometry fusion
CN113838469A (zh) 一种身份识别方法、系统及存储介质
Yang et al. User verification based on customized sentence reading
Chakraborty et al. An improved approach to open set text-independent speaker identification (OSTI-SI)
Mohamed et al. An Overview of the Development of Speaker Recognition Techniques for Various Applications.
Wan Speaker verification systems under various noise and SNR conditions
Toledano et al. BioSec Multimodal Biometric Database in Text-Dependent Speaker Recognition.
CN115641853A (zh) 声纹识别方法及装置、计算机可读存储介质、终端

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19927222

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19927222

Country of ref document: EP

Kind code of ref document: A1