CN111199729A

CN111199729A - Voiceprint recognition method and device

Info

Publication number: CN111199729A
Application number: CN201811378714.0A
Authority: CN
Inventors: 赵情恩; 索宏彬; 雷赟
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2020-05-26
Anticipated expiration: 2038-11-19
Also published as: CN111199729B

Abstract

The invention discloses a voiceprint recognition method, which comprises the following steps: counting the distribution condition of a plurality of similarity scores, wherein the similarity scores are used for representing the similarity between the voice signal to be recognized and the pre-stored voiceprint; and adjusting a classification threshold according to the distribution condition, wherein the classification threshold is used for classifying the similarity degree values so as to judge whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user. The invention also discloses a corresponding voiceprint recognition device.

Description

Voiceprint recognition method and device

Technical Field

The invention relates to the technical field of machine intelligence, in particular to a voiceprint recognition method and device.

Background

With the development of the internet of things and artificial intelligence technology, some intelligent voice devices, such as intelligent sound boxes and intelligent robots with voice interaction modules, appear on the market. In some usage scenarios, the smart voice device may confirm the user identity through a voiceprint recognition technique, and then provide personalized services to the user according to the user identity.

Voiceprint (Voiceprint) refers to a sound wave spectrum carrying speech information in human voice, has unique biological characteristics and has an identity recognition function. Voiceprint Identification (Voiceprint Identification), also called Speaker Identification (Speaker Identification), is a biometric Identification technology that extracts voice features from voice signals sent by a Speaker and verifies the identity of the Speaker accordingly. The voiceprint recognition process generally includes the steps that voiceprint information of a certain user or certain users is stored in advance (the user storing the voiceprint information is a registered user), voice features extracted from a voice signal of a speaker are compared with the voiceprint stored in advance to obtain a similarity score, then the similarity score is compared with a threshold value, and if the similarity score is larger than the threshold value, the speaker is considered to be the registered user corresponding to the voiceprint; and if the score is less than or equal to the threshold value, the speaker is not the registered user corresponding to the voiceprint.

In the prior art, voice equipment usually adopts a fixed and unchangeable threshold value to perform voiceprint recognition. However, the application scenes of the respective audio devices are different, and the application scenes of the same audio device may change with the passage of time. The fixed and unchangeable threshold is difficult to adapt to different and change of application scenes of the voice equipment, false recognition (namely, a speaker is not a registered user and is judged as a registered user) is easily caused when the threshold is set to be too small, missing recognition (namely, the speaker is originally the registered user and is judged as not the registered user) is easily caused when the threshold is set to be too large, the accuracy is difficult to ensure, and the user experience is not friendly enough.

Disclosure of Invention

To this end, the present invention provides a voiceprint recognition method and apparatus in an attempt to solve, or at least alleviate, the above-identified problems.

According to an aspect of the present invention, there is provided a voiceprint recognition method including: counting the distribution condition of a plurality of similarity values, wherein the similarity values are used for representing the similarity between the voice signal to be recognized and the pre-stored voiceprint; and adjusting a classification threshold according to the distribution condition, wherein the classification threshold is used for classifying the similarity degree values so as to judge whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user.

According to an aspect of the present invention, there is provided a voiceprint recognition apparatus comprising: the statistical module is suitable for counting the distribution condition of a plurality of similarity values, and the similarity values are used for representing the similarity between the voice signal to be recognized and the pre-stored voiceprint; and the threshold optimization module is suitable for adjusting a classification threshold according to the distribution condition, and the classification threshold is used for classifying the similarity degree values so as to judge whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user.

According to an aspect of the present invention, there is provided a smart speaker/tv adapted to perform the voiceprint recognition method as described above.

According to the technical scheme, the voice print recognition threshold value can be automatically adjusted according to the historical voice print recognition condition of the voice equipment, so that the accuracy of voice print recognition is improved, and the user experience is optimized.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a voiceprint recognition system 100 according to one embodiment of the invention;

FIG. 2 shows a schematic diagram of a voiceprint recognition process according to one embodiment of the invention;

FIG. 3 illustrates a flow diagram of a voiceprint recognition method 300 according to one embodiment of the invention;

FIG. 4 shows a schematic diagram of adjusting classification thresholds according to five embodiments of the invention;

FIG. 5 illustrates a flow diagram of a voiceprint recognition method 500 according to one embodiment of the invention;

FIG. 6 shows a schematic diagram of a voiceprint recognition process according to another embodiment of the invention;

FIG. 7 shows a schematic diagram of a computing device 700, according to one embodiment of the invention;

FIG. 8 shows a schematic diagram of a voiceprint recognition apparatus 800 according to one embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 shows a schematic diagram of a voiceprint recognition system 100 according to one embodiment of the invention. As shown in fig. 1, system 100 includes a voice device 110 and a server 120. It should be noted that the system 100 shown in fig. 1 is only an example, and those skilled in the art will understand that in practical applications, the system 100 generally includes a plurality of voice devices 110 and servers 120, and the present invention does not limit the number of voice devices 110 and servers 120 included in the system 100.

Voice device 110 is a device having a voice interaction module that can receive voice instructions from a user and return voice or non-voice information to the user. A typical voice interaction module includes a voice input unit such as a microphone, a voice output unit such as a speaker, and a processor. The voice device 110 may be, for example, a smart speaker, a smart robot with a voice interaction module, a smart appliance (including a smart television, a smart refrigerator, etc.), and the like, but is not limited thereto. One application scenario of the voice device 110 is a home scenario, that is, the voice device 110 is placed in a home of a user, and the user can issue a voice instruction to the voice device 110 to perform some functions, such as accessing the internet, ordering songs, shopping, knowing weather forecast, controlling smart home devices in the home, and so on. The server 120 is used to provide acoustic computing services to the voice devices 110, and may be, for example, a cloud server physically located at one or more sites.

In one embodiment, system 100 also includes terminal device 130. The terminal device may be, for example, a mobile phone, a tablet computer, a smart wearable device, and the like, but is not limited thereto. Terminal device 130 is typically located in the same geographical location as voice device 110 and is used to assist voice device 110 in performing corresponding functions. In an embodiment, the voice device 110 is a smart speaker disposed in a home of the user, the terminal device 130 is a mobile phone of the user, and the mobile phone of the user may be paired with the smart speaker in the home to manage and set the smart speaker, assist the smart speaker to connect to a wireless network, and implement functions such as smart home control.

According to the embodiment of the invention, the voice device 110 receives the voice signal sent by the user, and the voice device 110 and the server 120 cooperate to recognize the identity of the user according to the voice signal, so as to provide personalized services, such as song recommendation, commodity recommendation and the like, for the user according to the identity of the user.

In one embodiment, a user may register a voiceprint on voice device 110, with the user registered with the voiceprint being a registered user. The process of registering the voiceprint by the user may be: the voice device 110 receives a voice signal for registration issued by a user and uploads the voice signal to the server 120. The server 120 builds a voiceprint model of the user according to the voice signal, and stores the identification of the voice device 110, the identification of the registered user and the voiceprint model of the registered user in a correlation mode. In some embodiments, when the user registers the voiceprint, the basic information of the user, such as sex, age, hobbies, and the like, is uploaded to the server 120 for subsequent use in providing personalized services to the user. Accordingly, server 120 may store the basic information uploaded by the user in association with the identity of voice device 110, the identity of the registered user, and the voiceprint model of the registered user.

After the user completes the voiceprint registration on the voice device 110, when someone (the user who has registered the voiceprint, the friends and friends of the user, or the like) uses the voice device 110, the voice device 110 receives a voice signal sent by the speaker, and sends the voice signal to the server 120. The server 120 obtains the voiceprint model of the registered user corresponding to the voice device 110, and compares the voice signal of the speaker with the voiceprint model of the registered user to obtain a similarity score of the voice signal of the speaker, where the similarity score is used to represent the similarity between the voice signal of the speaker and the registered voiceprint. Then, comparing the similarity score with a classification threshold, and if the similarity score is greater than the classification threshold, determining that the speaker is the registered user corresponding to the voiceprint; and if the similarity score is less than or equal to the classification threshold, the speaker is not the registered user corresponding to the voiceprint.

In the prior art, when a plurality of voice devices 110 exist in the system 100, each voice device usually adopts a fixed threshold value for voiceprint recognition. However, the application scenes of the respective audio devices are different, and the application scenes of the same audio device may change with the passage of time. The fixed and unchangeable threshold is difficult to adapt to different and change of application scenes of the voice equipment, false recognition (namely, a speaker is not a registered user and is judged as a registered user) is easily caused when the threshold is set to be too small, missing recognition (namely, the speaker is originally the registered user and is judged as not the registered user) is easily caused when the threshold is set to be too large, the accuracy is difficult to ensure, and the user experience is not friendly enough. In order to solve the problem, the system 100 of the present invention can implement a voiceprint recognition method, which automatically adjusts the threshold of voiceprint recognition according to the historical voiceprint recognition condition of the voice device, so as to improve the accuracy of voiceprint recognition and optimize the user experience.

FIG. 2 shows a schematic diagram of a voiceprint recognition process according to one embodiment of the invention. As shown in fig. 2, in the optimization process, the voice device 110 receives a voice signal to be recognized and transmits it to the server 120. The server 120 performs voiceprint recognition on the voice signals to be recognized, performs threshold optimization, and adjusts a classification threshold according to the similarity scores of the voice signals to be recognized.

As shown in fig. 2, a speech signal to be recognized is picked up by the speech device 110. In one embodiment, the voice device 110 is a smart speaker, and when a user wants to use the smart speaker, the user needs to speak a corresponding wake-up word to wake up the smart speaker, and the wake-up word sound signal sent by the user is the voice signal to be recognized. It should be noted that the wake-up word may be preset when the voice device 110 leaves the factory, or may be set by the user during the process of using the speaker. For example, when voice device 110 is a smart speaker named "tianmaoling," the wake up word may be set to "tianmaoling," "hello, tianmaoling," and so on.

In the above embodiment, the voice signal to be recognized is a wake-up word for the user to wake up the voice device 110. It will be understood by those skilled in the art that the speech signal to be recognized may be any speech signal uttered by the user and is not limited to the wake-up word. For example, the voice signal to be recognized may also be a voice indication made by the user, such as "please recommend a song to me", "please report a news item", etc.

As shown in fig. 2, after the voice device 110 collects the voice signal to be recognized, the voice signal is sent to the server 120 for voiceprint recognition. According to one embodiment, the process of voiceprint recognition includes steps Step 1-Step 3.

Step1, identifying a voice signal to be identified so as to determine a similarity score of the voice signal.

First, a speech feature of a speech signal to be recognized is extracted. The speech feature may be, for example, but not limited to, a mel-frequency cepstral coefficient, a linear prediction cepstral coefficient, i-vector, d-vector, x-vector, j-vector, etc. of the speech signal. One skilled in the art can select a certain parameter or a combination of several parameters as the speech feature according to actual needs, and the selection of the speech feature is not limited by the present invention.

And then, comparing the voice characteristics of the voice signal to be recognized with the pre-stored voiceprint to determine the similarity score of the voice signal.

The server 120 stores voiceprints of a plurality of registered users, and stores an association relationship among the voice device, the registered user, and the voiceprint of the registered user. After receiving the voice signal to be recognized sent by the voice device 110, the server 120 may obtain the registered user corresponding to the voice device 110 and the voiceprint of the registered user according to the association relationship. The voiceprint of the registered user corresponding to the voice device 110 is recorded as a pre-stored voiceprint.

When there is only one pre-stored voiceprint, the server 120 compares the voice feature of the voice signal to be recognized with the pre-stored voiceprint, and calculates a similarity score between the voice feature of the voice signal to be recognized and the pre-stored voiceprint. When there are a plurality of pre-stored voiceprints, the server 120 compares the voice feature of the voice signal to be recognized with each pre-stored voiceprint respectively, and calculates a similarity score between the voice feature and each pre-stored voiceprint respectively.

The similarity score can be calculated by using algorithms such as a Support Vector Machine (SVM), LDA (Linear Discriminant Analysis), PLDA (Probabilistic Linear Discriminant Analysis), likelihood, Cosine Distance (Cosine Distance), and the like, and the calculation method of the similarity score is not limited by the present invention.

And finally, storing the similarity score of the voice signal to be recognized and the identification of the voice equipment which collects the voice signal in a correlation manner.

The number of the similarity scores calculated through the steps is the same as the number of the pre-stored voiceprints. And when only one voiceprint is prestored, only one similarity score is calculated, and at the moment, the similarity score is stored in association with the identifier of the voice equipment which acquires the voice signal to be recognized. When a plurality of voiceprints are pre-stored, the number of the calculated similarity scores is also multiple, and at the moment, the maximum value of the similarity scores is stored in association with the identifier of the voice device for collecting the voice signal to be recognized. For example, the speech device 110 corresponds to three pre-stored voice prints vp1, vp2, vp3, the voice feature of the speech signal to be recognized is vf, the similarity scores of the voice feature vf and the pre-stored voice prints vp1, vp2, vp3 are calculated to be score1, score2, score3, respectively, and score1< score2< score3, then only score3 is stored in association with the identity of the speech device 110.

Thus, only one similarity score is ultimately stored for one speech signal to be recognized.

And step2, classifying the similarity score according to a classification threshold value so as to judge whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user.

When the similarity score is larger than the classification threshold value, judging that the voice signal to be recognized and the pre-stored voiceprint correspond to the same user; and when the similarity score is less than or equal to the classification threshold, judging that the voice signal to be recognized and the pre-stored voiceprint correspond to different users.

For example, the voice device 110 corresponds to a pre-stored voiceprint vp, the voiceprint vp corresponds to the registered user, the voice feature of the voice signal to be recognized is vf, the classification threshold is tv, and the similarity score between the voice feature vf and the pre-stored voiceprint vp is score. If score > tv, the voice signal to be recognized is considered to be sent out by the user of the registered user; and if score is less than or equal to tv, the voice signal to be recognized is not sent out by the user of the registered user.

For another example, the speech device 110 corresponds to three pre-stored voiceprints vp1, vp2 and vp3, the three voiceprints correspond to registered users user1, user2 and user3 respectively, the speech feature of the speech signal to be recognized is vf, the classification threshold is tv, and the similarity scores of the speech feature vf and the pre-stored voiceprints vp1, vp2 and vp3 are score1, score2 and score3 respectively. Respectively judging whether score1, score2 and score3 are greater than tv, and if score1> tv, determining that the voice signal to be recognized is sent out by a user1 of the registered user; if score2> tv, the voice signal to be recognized is considered to be sent out by the registered user 2; if score3> tv, the voice signal to be recognized is considered to be sent out by the registered user 3; if none of score1, score2 and score3 is greater than tv, the speech signal to be recognized is not considered to be emitted by the registered users user1, user2 and user 3.

Step3. the recognition result obtained at Step2 is returned to speech device 110.

It should be noted that Step3 is not essential, and in some embodiments, the recognition result may be returned to speech device 110. In other embodiments, the recognition result may not be returned to the voice device 110, but only stored in the server 120 for providing personalized services to the user. In some embodiments, the recognition result may not be returned to the speech device 110, but the personalized service provided according to the recognition result may be directly returned to the speech device, for example, the speech signal to be recognized collected by the speech device 110 is a speech instruction "please recommend me a song" issued by the user, the server 120 recognizes the speech signal, confirms the user identity, and further personalizes and recommends the song to the user according to the user identity, and the server 120 may return only the recommended song to the speech device 110 for playing, without returning a specific user identity recognition result to the speech device 110.

FIG. 3 shows a flow diagram of a voiceprint recognition method 300 according to one embodiment of the invention. The method 300 is performed in a server, corresponding to the threshold optimization portion of fig. 2. As shown in fig. 3, the method 300 begins at step S310.

In step S310, the distribution of the plurality of similarity scores is counted. As described above, the similarity score is used to represent the similarity between the speech signal to be recognized and the pre-stored voiceprint.

According to one embodiment, when the number of the voice signals to be recognized collected by the voice equipment in an accumulated mode is larger than a number threshold, the distribution situation of the similarity scores of the plurality of voice signals to be recognized collected by the voice equipment is counted. That is, when the number of voice signals to be recognized which are accumulatively collected by the voice equipment is greater than a number threshold, the classification threshold of the voice equipment is optimized; when the number of the voice signals to be recognized collected by the voice equipment in an accumulated mode is smaller than or equal to the number threshold, the classification threshold of the voice equipment is not optimized, and a preset default threshold is adopted all the time.

Each voice signal to be recognized is correspondingly stored with a similarity score. When the voice device is just put into use, the collected voice signals to be recognized are fewer, correspondingly, the similarity scores correspondingly stored by the voice device are also fewer, a small number of similarity scores (statistical samples) are subjected to statistical analysis, a credible conclusion formed by users of the voice device is difficult to obtain, and the updating effect of the classification threshold is poor. Therefore, when the voice signals to be recognized collected by the voice equipment are more, namely the stored similarity values are more (more than the quantity threshold value), the similarity values are subjected to statistical analysis, a more reliable statistical result can be obtained, and a more optimal classification threshold value is obtained. It should be noted that the specific value of the number threshold may be set by a person skilled in the art according to actual situations, the value of the number threshold is not limited by the present invention, and in one embodiment, the number threshold may be set to 200.

The distribution of the plurality of similarity values may be counted according to the following steps S312 and S314.

Step S312: a frequency distribution map or histogram of the plurality of similarity scores is determined.

The horizontal axis (x axis) of the frequency distribution graph is the similarity score, and the vertical axis (y axis) is the frequency. The frequency distribution map can be determined as follows: first, the maximum value and the minimum value of a plurality of similarity scores are determined, the difference between the maximum value and the minimum value is calculated, and the difference is the range. The similarity scores are divided into groups according to the range, and the group distance (i.e., the distance between the two endpoints of each group) of each group is the same. The number of similarity scores included in each group, i.e., the frequency of each group, is counted. And drawing a frequency distribution graph of the similarity score according to the group distance and the frequency.

The horizontal axis (x-axis) of the histogram is the similarity score and the vertical axis (y-axis) is the frequency. The frequency distribution map may be determined as follows: first, the maximum value and the minimum value of a plurality of similarity scores are determined, the difference between the maximum value and the minimum value is calculated, and the difference is the range. The similarity scores are divided into groups according to the range, and the group distance (i.e., the distance between the two endpoints of each group) of each group is the same. And counting the number of the similarity scores included in each group, namely the frequency of each group, and dividing the frequency of each group by the total number of the similarity scores to obtain the frequency of each group. And drawing a frequency distribution graph of the similarity score according to the group distance and the frequency.

Step S314: and performing gaussian curve fitting on the frequency distribution map or the frequency distribution map obtained in the step S312 to obtain at least one gaussian curve.

The gaussian curve is a normal distribution curve, and the function expression thereof is as follows:

where x is the similarity score, f (x) is the frequency or frequency, μ is the mean of the similarity scores, σ is the variance of the similarity scores, and exp () represents an exponential function with the natural constant e as the base. The frequency distribution map is the same as the number of gaussian curves to which the frequency distribution map is fitted and the mean value μ of each gaussian curve, and only the variance σ differs.

Subsequently, in step S320, the classification threshold is adjusted according to the distribution of the similarity scores. As described above, the classification threshold is used to classify the similarity score to determine whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user.

According to one embodiment, the classification threshold may be adjusted as follows:

and if the mean value of all the Gaussian curves is larger than the current classification threshold, updating the classification threshold into a numerical value with the absolute value of the difference between the minimum similarity score and the minimum similarity score on the Gaussian curve within a preset range. Typically, the gaussian curve with a mean greater than the current classification threshold is formed from the similarity scores of the registered users. In the step, the classification threshold value is updated to be close to the minimum similarity score on the Gaussian curve, so that the similarity score of the registered user is ensured to be higher than the classification threshold value with higher probability, and the voiceprint recognition authentication is realized.

In one embodiment, the classification threshold may be updated to the minimum similarity score on the gaussian curve. In another embodiment, the classification threshold may be updated to a value smaller than the minimum similarity score on the gaussian curve by a first preset value, or to a value larger than the minimum similarity score on the gaussian curve by a second preset value. In yet another embodiment, the classification threshold may be set to a value greater than the minimum similarity score on the gaussian curve, and the area of the gaussian curve to the right of the updated classification threshold may account for 90% of the total coverage area of the gaussian curve, which may ensure that 90% of the registered users are authenticated by voiceprint recognition.

And if the mean value of at least one Gaussian curve is less than or equal to the current classification threshold value, updating the classification threshold value to a numerical value between the minimum similarity score on the first Gaussian curve and the maximum similarity score on the second Gaussian curve, wherein the first Gaussian curve is a Gaussian curve with the mean value greater than the current classification threshold value, and the second Gaussian curve is a Gaussian curve with the mean value less than or equal to the current classification threshold value. In general, a gaussian curve with a mean value greater than the current classification threshold is formed by the similarity values of the registered users, and a gaussian curve with a mean value less than or equal to the current classification threshold is formed by the similarity values of the unregistered users. In the step, the classification threshold value is updated to be a numerical value between the minimum similarity score on the first Gaussian curve and the maximum similarity score on the second Gaussian curve, so that registered users and non-registered users can be better distinguished, and the false identification and the missing identification of the registered users are avoided.

In one embodiment, the classification threshold may be updated as the average of the minimum similarity score on the first Gaussian curve and the maximum similarity score on the second Gaussian curve. In another embodiment, the classification threshold may be updated as an average of the minimum mean of the first gaussian and the maximum mean of the second gaussian. Of course, the above are only two examples of updating the classification threshold, and the skilled person can set the classification threshold to any value between the minimum similarity score of the first gaussian and the maximum similarity score on the second gaussian according to actual needs.

Fig. 4 shows five embodiments of adjusting the classification threshold. In the frequency (frequency) distribution map a, the current classification threshold is x 0. Only one gaussian curve 410a is fitted in graph a and the mean of the gaussian curve is larger than the current classification threshold x 0. In this figure, the classification threshold may be updated to the minimum similarity score on the gaussian curve, i.e., the classification threshold is updated to a 1.

In the frequency (frequency) distribution graph b, the current classification threshold is x 0. Two

gaussian curves

410b, 420b are fitted to the graph b, and the mean of the two gaussian curves is larger than the current classification threshold x 0. In this figure, the classification threshold may be updated to a value slightly smaller than the minimum similarity score on the gaussian curve. The minimum similarity score on the gaussian is b2 and the classification threshold is updated to a value b1 slightly less than b 2.

In the frequency (frequency) distribution map c, the current classification threshold is x 0. Three

gaussian curves

410c, 420c, 430c are fitted in the graph c, the mean of the

gaussian curves

410c and 420c is smaller than the current classification threshold x0, and the mean of the gaussian curve 430c is larger than the current classification threshold x 0. In this figure, the classification threshold may be updated as the average of the maximum similarity score c2 on gaussian 420c and the minimum similarity score c4 on gaussian 430c, i.e. the classification threshold is updated as c3 being 0.5 (c2+ c 4).

In the frequency (frequency) distribution map d, the current classification threshold is x 0. Two

gaussian curves

410d, 420d are fitted to the graph d, the mean of the gaussian curve 410d is smaller than the current classification threshold x0, and the mean of the gaussian curve 420d is larger than the current classification threshold x 0. In this figure, the classification threshold may be updated as the average of the mean d1 of gaussian 410d and the mean d5 of gaussian 420d, i.e., the classification threshold is updated as d3 is 0.5 × (d1+ d 5).

In the frequency (frequency) distribution graph e, the current classification threshold is x 0. Three

gaussian curves

410e, 420e, 430e are fitted in the graph e, the mean values of the

gaussian curves

410e and 420e are both smaller than the current classification threshold x0, and the mean value of the gaussian curve 430e is larger than the current classification threshold x 0. In this figure, the classification threshold may be updated as the average of the maximum mean e1 of

gaussian curves

410e, 420e and the mean e5 of gaussian curve 430e, i.e., the classification threshold is updated as e3 ═ 0.5 × (e1+ e 5).

FIG. 5 shows a flow diagram of a voiceprint recognition method 500 according to one embodiment of the invention. The method 500 is performed in a server, and includes the voiceprint recognition and threshold optimization process illustrated in fig. 2, which is intended to illustrate the order and logical relationship of the execution of the voiceprint recognition and threshold optimization. As shown in fig. 5, the method 500 begins at step S510.

In step S510, a speech signal to be recognized is recognized to determine a similarity score of the speech signal.

In one embodiment, the speech signal to be recognized is captured by a speech device, which may be, for example, but not limited to, a smart speaker. The similarity score of the speech signal can be determined according to the Step1, which is not described herein.

When the number of the voice signals to be recognized collected cumulatively by the voice device is small (less than or equal to the number threshold), only step S540 is executed without executing steps S520 and S530, that is, only the voice print recognition is performed on the collected voice signals without optimizing the classification threshold.

When the number of the voice signals to be recognized collected by the voice device cumulatively is large (larger than the number threshold), the voiceprint recognition process of step S540 is executed in parallel with the threshold optimization process of steps S520 and S530.

In step S540, the similarity scores calculated in step S510 are classified according to a classification threshold, so as to determine whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user. It should be noted that the classification threshold used in step S540 is the current classification threshold, that is, the classification threshold that has not been updated yet. If the similarity score is larger than the classification threshold, judging that the voice signal to be recognized and the pre-stored voiceprint correspond to the same user; and if the similarity score is less than or equal to the classification threshold, judging that the voice signal to be recognized and the pre-stored voiceprint correspond to different users. For the specific implementation process of Step S540, reference may be made to the foregoing description of Step2, and details are not described herein again.

In step S520, the distribution of the plurality of similarity scores is counted. For the specific implementation process of step S520, reference may be made to the foregoing description of step S310, and details are not repeated here.

In step S530, the classification threshold is adjusted according to the distribution of the similarity scores. For the specific implementation process of step S530, reference may be made to the foregoing description of step S320, which is not described herein again. And the new classification threshold obtained after adjustment can be used for carrying out voiceprint recognition on the next voice signal to be recognized, which is acquired by the voice equipment.

As can be seen from the method 500, for a speech signal to be recognized newly collected by a speech device, voiceprint recognition needs to be performed on the speech signal according to the procedures shown in steps S510 and S540 to obtain a recognition result, that is, it is determined whether the speech signal and a pre-stored voiceprint correspond to the same user. In addition, aiming at the voice signal to be recognized newly collected by the voice equipment, the number of the voice signals to be recognized collected by the voice equipment is increased by one. If the number of the voice signals to be recognized collected by the voice equipment is less than or equal to the number threshold, the threshold optimization steps in the steps S520 and S530 are not executed; if the number of the speech signals to be recognized collected by the speech device is greater than the number threshold, steps S520 and S530 are executed to update and optimize the classification threshold to determine a new classification threshold. The new classification threshold may be used for voiceprint recognition of the next speech signal to be recognized that is collected by the speech device.

FIG. 6 shows a schematic diagram of a voiceprint recognition process according to another embodiment of the invention. The voiceprint recognition process shown in fig. 6 differs from that of fig. 2 only in that the voiceprint recognition process shown in fig. 6 is performed at speech device 110, while the voiceprint recognition process shown in fig. 2 is performed at server 120. For the specific implementation process of each step in fig. 6, reference may be made to the foregoing description related to fig. 2, and details are not repeated here.

At present, the voice device is limited by portability and aesthetic property, the hardware configuration is low, and the computing power is weak, so most of the computing process is executed by the server with stronger computing power, as shown in fig. 2. If the hardware configuration of the speech device is improved in the future and the computing power is improved, the computing process may be completed at the speech device, as shown in fig. 6.

FIG. 7 shows a schematic diagram of a computing device 700, according to one embodiment of the invention. As shown in fig. 7, in a basic configuration 702, a computing device 700 typically includes a system memory 706 and one or more processors 704. A memory bus 708 may be used for communicating between the processor 704 and the system memory 706.

Depending on the desired configuration, the processor 704 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. Processor 604 may include one or more levels of cache, such as a level one cache 710 and a level two cache 712, a processor core 714, and registers 716. Example processor core 714 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 718 may be used with the processor 704, or in some implementations the memory controller 718 may be an internal part of the processor 704.

Depending on the desired configuration, the system memory 706 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 706 may include an operating system 720, one or more applications 722, and program data 724. The application 722 is actually a plurality of program instructions that direct the processor 704 to perform corresponding operations. In some embodiments, the application 722 may be arranged to cause the processor 704 to operate with program data 724 on an operating system.

The computing device 700 may also include an interface bus 740 that facilitates communication from various interface devices (e.g., output devices 742, peripheral interfaces 744, and communication devices 746) to the basic configuration 702 via the bus/interface controller 730. The example output devices 742 include a graphics processing unit 748 and an audio processing unit 750. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 752. Example peripheral interfaces 744 can include a serial interface controller 754 and a parallel interface controller 756, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 758. An example communication device 746 may include a network controller 760, which may be arranged to facilitate communications with one or more other computing devices 762 over a network communication link via one or more communication ports 764.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

In a computing device 700 according to the invention, an application 722, for example, may comprise a voiceprint recognition apparatus 800, the apparatus 800 comprising a plurality of program instructions that may instruct the processor 704 to perform the voiceprint recognition method 300 of the invention. The computing device 700 may be embodied as, but is not limited to, a voice device (e.g., smart speaker, smart television, etc.) or a server.

FIG. 8 shows a schematic diagram of a voiceprint recognition apparatus 800 according to one embodiment of the invention. The apparatus 800 may reside in a voice device (e.g., the aforementioned voice device 110) or a server (e.g., the aforementioned server 120) for performing the voiceprint recognition method 300 of the present invention. As shown in fig. 8, the voiceprint recognition apparatus 800 includes a statistics module 810 and a threshold optimization module 820.

The statistic module 810 is adapted to count a distribution of a plurality of similarity scores, where the similarity scores are used to represent the similarity between the speech signal to be recognized and a pre-stored voiceprint. The statistic module 810 is specifically configured to execute the method of the step S310, and for processing logic and functions of the statistic module 810, reference may be made to the related description of the step S310, which is not described herein again.

The threshold optimization module 820 is adapted to adjust a classification threshold according to a distribution of the similarity values, where the classification threshold is used to classify the similarity values to determine whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user. The threshold optimization module 820 is specifically configured to execute the method of step S320, and for processing logic and functions of the threshold optimization module 820, reference may be made to the related description of step S320, which is not described herein again.

According to one embodiment, the apparatus 800 further comprises a voiceprint recognition module 830. The voiceprint recognition module 830 is adapted to recognize a speech signal to be recognized to determine a similarity score of the speech signal. The voiceprint recognition module 830 is specifically configured to execute the methods of steps 1-Step 3, and for the processing logic and functions of the voiceprint recognition module 830, reference may be made to the related descriptions of steps 1-Step 3, which are not described herein again.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the data storage method and/or the data query method of the present invention according to instructions in the program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense with respect to the scope of the invention, as defined in the appended claims.

Claims

1. A voiceprint recognition method comprising:

counting the distribution condition of a plurality of similarity values, wherein the similarity values are used for representing the similarity between the voice signal to be recognized and the pre-stored voiceprint;

and adjusting a classification threshold according to the distribution condition, wherein the classification threshold is used for classifying the similarity degree values so as to judge whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user.

2. The method as claimed in claim 1, wherein the speech signal to be recognized is collected by a speech device, and the step of counting the distribution of the plurality of similarity scores comprises:

and when the number of the voice signals to be recognized collected by the voice equipment in an accumulated mode is larger than a number threshold, counting the distribution situation of the similarity values of the plurality of voice signals to be recognized collected by the voice equipment.

3. The method of claim 2, wherein the voice device comprises a smart speaker.

4. The method according to any one of claims 1 to 3, wherein the step of counting the distribution of the plurality of similarity scores comprises:

determining a frequency distribution map or a frequency distribution map of the plurality of similarity scores;

and performing Gaussian curve fitting on the frequency distribution diagram or the frequency distribution diagram to obtain at least one Gaussian curve.

5. The method of claim 4, wherein the step of adjusting the classification threshold according to the distribution comprises:

if the mean value of all Gaussian curves is larger than the current classification threshold, updating the classification threshold to a numerical value of which the absolute value of the difference with the minimum similarity score on the Gaussian curve is within a preset range;

and if the mean value of at least one Gaussian curve is less than or equal to the current classification threshold value, updating the classification threshold value to a numerical value between the minimum similarity score on a first Gaussian curve and the maximum similarity score on a second Gaussian curve, wherein the first Gaussian curve is a Gaussian curve with the mean value greater than the current classification threshold value, and the second Gaussian curve is a Gaussian curve with the mean value less than or equal to the current classification threshold value.

6. The method of claim 5, wherein the step of updating the classification threshold to a value within a preset range of an absolute value of a difference from a minimum similarity score on the Gaussian curve comprises:

updating the classification threshold to be the minimum similarity score on the Gaussian curve.

7. The method of claim 5, wherein the step of updating the classification threshold to a numerical value between a minimum similarity score on a first Gaussian curve and a maximum similarity score on a second Gaussian curve comprises:

updating the classification threshold to be an average of a minimum similarity score on a first Gaussian curve and a maximum similarity score on a second Gaussian curve; or

And updating the classification threshold value as the average value of the minimum mean value of the first Gaussian curve and the maximum mean value of the second Gaussian curve.

8. The method as claimed in claim 1, wherein the step of classifying the similarity score by the classification threshold to determine whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user comprises:

when the similarity score is larger than the classification threshold value, judging that the voice signal to be recognized and the pre-stored voiceprint correspond to the same user;

and when the similarity score is less than or equal to the classification threshold, judging that the voice signal to be recognized and the pre-stored voiceprint correspond to different users.

9. The method of claim 1, wherein the step of counting the distribution of the plurality of similarity scores further comprises:

and identifying the voice signal to be identified so as to determine the similarity score of the voice signal.

10. The method of claim 9, wherein the step of recognizing the speech signal to be recognized to determine the similarity score of the speech signal comprises:

extracting voice features of the voice signals;

comparing the voice features with pre-stored voiceprints to determine similarity scores of the voice signals;

and storing the similarity score of the voice signal in association with the identification of the voice equipment for collecting the voice signal.

11. The method of claim 9, wherein the step of recognizing the speech signal to be recognized to determine the similarity score of the speech signal comprises:

extracting voice features of the voice signals;

when a plurality of pre-stored voiceprints exist, the voice characteristics are respectively compared with each pre-stored voiceprint to obtain a plurality of similarity scores;

and storing the maximum value of the plurality of similarity scores in association with the identification of the voice equipment for collecting the voice signals.

12. A voiceprint recognition apparatus comprising:

the statistical module is suitable for counting the distribution condition of a plurality of similarity values, and the similarity values are used for representing the similarity between the voice signal to be recognized and the pre-stored voiceprint;

and the threshold optimization module is suitable for adjusting a classification threshold according to the distribution condition, and the classification threshold is used for classifying the similarity degree values so as to judge whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user.

13. A smart speaker/tv adapted to perform the voiceprint recognition method of any one of claims 1 to 11.