CN111199729B

CN111199729B - Voiceprint recognition method and voiceprint recognition device

Info

Publication number: CN111199729B
Application number: CN201811378714.0A
Authority: CN
Inventors: 赵情恩; 索宏彬; 雷赟
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2023-09-26
Anticipated expiration: 2038-11-19
Also published as: CN111199729A

Abstract

The invention discloses a voiceprint recognition method, which comprises the following steps: counting the distribution situation of a plurality of similarity scores, wherein the similarity scores are used for representing the similarity between the voice signal to be identified and the prestored voiceprints; and adjusting a classification threshold according to the distribution condition, wherein the classification threshold is used for classifying the similarity scores so as to judge whether the voice signal to be recognized and the prestored voiceprint correspond to the same user. The invention also discloses a corresponding voiceprint recognition device.

Description

Voiceprint recognition method and voiceprint recognition device

Technical Field

The invention relates to the technical field of machine intelligence, in particular to a voiceprint recognition method and device.

Background

With the development of the internet of things and artificial intelligence technology, some intelligent voice devices, such as intelligent sound boxes and intelligent robots with voice interaction modules, are appeared on the market. In some usage scenarios, the intelligent voice device may confirm the user identity through voiceprint recognition techniques, thereby providing personalized services to the user in accordance with the user identity.

Voiceprint (Voiceprint) refers to the sound wave spectrum carrying speech information in human voice, has unique biological characteristics, and has the function of identity recognition. Voiceprint recognition (Voiceprint Identification), also known as speaker recognition (Speaker Identification), is a biometric technique that extracts speech features from a speech signal emitted by a speaker and based thereon authenticates the speaker. The voiceprint recognition process is generally to store voiceprint information of a certain or some users in advance (the users storing the voiceprint information are registered users), compare the voice features extracted from the voice signals of the speakers with the prestored voiceprints to obtain a similarity score, then compare the score with a threshold value, and if the score is greater than the threshold value, consider that the speakers are registered users corresponding to the voiceprints; if the score is less than or equal to the threshold, the speaker is not considered to be the registered user corresponding to the voiceprint.

In the prior art, voice devices typically employ a fixed threshold for voiceprint recognition. However, the application scenes of the voice devices are different, and the application scene of the same voice device may change with the passage of time. The fixed threshold is difficult to adapt to different and changing application scenes of the voice equipment, the threshold is set too small to easily cause misrecognition (namely, a speaker is not a registered user but is judged to be the registered user), and the threshold is set too large to easily cause misrecognition (namely, the speaker is originally the registered user but is judged not to be the registered user), so that accuracy is difficult to be ensured, and user experience is not friendly enough.

Disclosure of Invention

Accordingly, the present invention provides a voiceprint recognition method and apparatus that seeks to solve or at least mitigate the above-identified problems.

According to an aspect of the present invention, there is provided a voiceprint recognition method comprising: counting distribution conditions of a plurality of similarity scores, wherein the similarity scores are used for representing similarity between a voice signal to be identified and a prestored voiceprint; and adjusting a classification threshold according to the distribution condition, wherein the classification threshold is used for classifying the similarity scores so as to judge whether the voice signal to be recognized and the prestored voiceprint correspond to the same user.

According to an aspect of the present invention, there is provided a voiceprint recognition apparatus comprising: the statistics module is suitable for counting the distribution situation of a plurality of similarity scores, and the similarity scores are used for representing the similarity between the voice signal to be identified and the prestored voiceprints; and the threshold optimization module is suitable for adjusting a classification threshold according to the distribution condition, and the classification threshold is used for classifying the similarity scores so as to judge whether the voice signal to be identified and the prestored voiceprint correspond to the same user or not.

According to one aspect of the present invention, a smart speaker/television is provided, which is adapted to perform the voiceprint recognition method as described above.

According to the technical scheme, the voice print recognition threshold value can be automatically adjusted according to the historical voice print recognition condition of the voice equipment, so that the voice print recognition accuracy is improved, and the user experience is optimized.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which set forth the various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to fall within the scope of the claimed subject matter. The above, as well as additional objects, features, and advantages of the present disclosure will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Like reference numerals generally refer to like parts or elements throughout the present disclosure.

FIG. 1 shows a schematic diagram of a voiceprint recognition system 100 in accordance with one embodiment of the present invention;

FIG. 2 shows a schematic diagram of a voiceprint recognition process in accordance with one embodiment of the present invention;

FIG. 3 illustrates a flow chart of a voiceprint recognition method 300 in accordance with one embodiment of the present invention;

FIG. 4 shows a schematic diagram of adjusting classification thresholds according to five embodiments of the invention;

FIG. 5 illustrates a flow chart of a voiceprint recognition method 500 in accordance with one embodiment of the present invention;

FIG. 6 shows a schematic diagram of a voiceprint recognition process in accordance with another embodiment of the present invention;

FIG. 7 shows a schematic diagram of a computing device 700 according to one embodiment of the invention;

Fig. 8 shows a schematic diagram of a voiceprint recognition device 800 in accordance with one embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 shows a schematic diagram of a voiceprint recognition system 100 in accordance with one embodiment of the present invention. As shown in fig. 1, system 100 includes a voice device 110 and a server 120. It should be noted that the system 100 shown in fig. 1 is only an example, and those skilled in the art will understand that in practical applications, the system 100 generally includes a plurality of voice devices 110 and servers 120, and the present invention does not limit the number of voice devices 110 and servers 120 included in the system 100.

The voice device 110 is a device with a voice interaction module that can receive voice indications from a user and return voice or non-voice information to the user. A typical voice interaction module includes a voice input unit such as a microphone, a voice output unit such as a speaker, and a processor. The voice device 110 may be, for example, but not limited to, a smart speaker, a smart robot with a voice interaction module, a smart home appliance (including a smart television, a smart refrigerator, etc.), etc. One application scenario of the voice device 110 is a home scenario, i.e., the voice device 110 is placed in a user's home, the user may issue voice instructions to the voice device 110 to perform certain functions, such as surfing the internet, requesting songs, shopping, learning about weather forecast, controlling smart home devices in the home, etc. The server 120 is used to provide acoustic computing services to the voice device 110, which may be, for example, a cloud server physically located at one or more sites.

In one embodiment, the system 100 further comprises a terminal device 130. The terminal device may be, for example, a mobile phone, a tablet computer, an intelligent wearable device, etc., but is not limited thereto. Terminal device 130 is typically co-located with voice device 110 for assisting voice device 110 in achieving the corresponding functionality. In one embodiment, the voice device 110 is a smart speaker disposed in a user's home, the terminal device 130 is a user's mobile phone, and the user's mobile phone can be paired with the smart speaker in the home to manage and set the smart speaker, and assist the smart speaker in connecting to a wireless network, so as to realize functions such as smart home control.

According to the embodiment of the invention, the voice device 110 receives the voice signal sent by the user, and the voice device 110 is matched with the server 120, so that the identity of the user can be identified according to the voice signal, and personalized services such as song recommendation, commodity recommendation and the like can be provided for the user according to the identity of the user.

In one embodiment, a user may register a voiceprint on voice device 110, with the voiceprint registered user being the registered user. The process of user registration voiceprint may be: the voice device 110 receives a voice signal for registration from the user and uploads the voice signal to the server 120. The server 120 establishes a voiceprint model of the user according to the voice signal, and stores the identification of the voice device 110, the identification of the registered user, and the voiceprint model of the registered user in association. In some embodiments, when the user registers the voiceprint, the user's basic information, such as gender, age, interests, etc., is uploaded to the server 120 together for subsequent use in providing personalized services to the user. Accordingly, the server 120 stores the basic information uploaded by the user in association with the identity of the voice device 110, the registered user identity, and the voiceprint model of the registered user.

After the user completes voiceprint registration on the voice device 110, when someone (the user who has registered the voiceprint may be himself or herself or friends of the user, etc.) uses the voice device 110, the voice device 110 receives a voice signal sent by the speaker and sends the voice signal to the server 120. The server 120 obtains a voiceprint model of the registered user corresponding to the voice device 110, compares the voice signal of the speaker with the voiceprint model of the registered user, and obtains a similarity score of the voice signal of the speaker, where the similarity score is used to represent the similarity between the voice signal of the speaker and the registered voiceprint. Then, comparing the similarity score with a classification threshold, and if the similarity score is larger than the classification threshold, considering that the speaker is the registered user corresponding to the voiceprint; if the similarity score is less than or equal to the classification threshold, the speaker is not considered to be the registered user corresponding to the voiceprint.

In the prior art, when there are multiple voice devices 110 in the system 100, each voice device typically employs a fixed threshold for voiceprint recognition. However, the application scenes of the voice devices are different, and the application scene of the same voice device may change with the passage of time. The fixed threshold is difficult to adapt to different and changing application scenes of the voice equipment, the threshold is set too small to easily cause misrecognition (namely, a speaker is not a registered user but is judged to be the registered user), and the threshold is set too large to easily cause misrecognition (namely, the speaker is originally the registered user but is judged not to be the registered user), so that accuracy is difficult to be ensured, and user experience is not friendly enough. To solve this problem, the system 100 of the present invention can perform a voiceprint recognition method to automatically adjust the threshold of voiceprint recognition according to the historical voiceprint recognition situation of the voice device, so as to improve the accuracy of voiceprint recognition and optimize the user experience.

FIG. 2 shows a schematic diagram of a voiceprint recognition process in accordance with one embodiment of the present invention. As shown in fig. 2, in the optimization process, the voice device 110 receives the voice signal to be recognized and transmits it to the server 120. The server 120 performs voiceprint recognition on the voice signals to be recognized, performs threshold optimization, and adjusts the classification threshold according to similarity scores of a plurality of voice signals to be recognized.

As shown in fig. 2, the speech signal to be recognized is collected by a speech device 110. In one embodiment, the voice device 110 is an intelligent speaker, and when the user wants to use the intelligent speaker, he needs to speak a corresponding wake-up word to wake up the intelligent speaker, and the wake-up word voice signal sent by the user is the voice signal to be recognized. It should be noted that, the wake-up word may be preset when the voice device 110 leaves the factory, or may be set by the user during the process of using the sound box, and the invention does not limit the length and content of the wake-up word. For example, when voice device 110 is a smart speaker named "heavenly cat sprite," the wake word may be set to "heavenly cat sprite," "hello, heavenly cat," and so on.

In the above embodiment, the voice signal to be recognized is a wake word for the user to wake up the voice device 110. It will be appreciated by those skilled in the art that the speech signal to be recognized may be any speech signal issued by the user and is not limited to wake-up words. For example, the voice signal to be recognized may also be a voice indication sent by the user, for example, "please recommend a song to me", "please broadcast a news", and so on.

As shown in fig. 2, after the voice device 110 collects the voice signal to be recognized, the voice signal is sent to the server 120 for voiceprint recognition. According to one embodiment, the voiceprint recognition process includes steps Step 1-Step 3.

Step1. Identifying the speech signal to be identified to determine a similarity score for the speech signal.

First, the speech features of the speech signal to be recognized are extracted. The speech features may be, for example, but not limited to, mel frequency cepstrum coefficients, linear prediction cepstrum coefficients, i-vector, d-vector, x-vector, j-vector, etc. of the speech signal. One skilled in the art can select a certain parameter or a combination of several parameters as the voice feature according to actual needs, and the invention does not limit the selection of the voice feature.

Subsequently, the speech characteristics of the speech signal to be identified are compared with pre-stored voiceprints to determine a similarity score for the speech signal.

The server 120 stores voiceprints of a plurality of registered users, and stores association relations among voice devices, registered users, and voiceprints of registered users. After the server 120 receives the voice signal to be recognized sent by the voice device 110, the voice device 110 may obtain the registered user and the voiceprint of the registered user according to the association relationship. The voiceprint of the registered user corresponding to the voice device 110 is recorded as a pre-stored voiceprint.

When there is only one pre-stored voiceprint, the server 120 compares the voice characteristics of the voice signal to be recognized with the pre-stored voiceprint, and calculates a similarity score between the voice characteristics of the voice signal to be recognized and the pre-stored voiceprint. When there are a plurality of pre-stored voiceprints, the server 120 compares the voice features of the voice signal to be recognized with each pre-stored voiceprint, respectively, and calculates a similarity score between the voice features and each pre-stored voiceprint, respectively.

The similarity score may be calculated by using algorithms such as a Support Vector Machine (SVM), LDA (Linear Discriminant Analysis ), PLDA (Probabilistic Linear Discriminant Analysis, probabilistic linear discriminant analysis), likelihood and Cosine Distance (Cosine Distance), and the method for calculating the similarity score is not limited in the present invention.

And finally, storing the similarity score of the voice signal to be recognized in association with the identification of the voice equipment collecting the voice signal.

The number of similarity scores calculated through the steps is the same as the number of pre-stored voiceprints. When only one voice print is stored, only one calculated similarity score is calculated, and at the moment, the similarity score is stored in association with the identification of the voice equipment collecting the voice signal to be recognized. When there are a plurality of prestored voiceprints, the calculated number of similarity scores is also a plurality, and at this time, the maximum value of the plurality of similarity scores is stored in association with the identification of the voice device which collects the voice signal to be recognized. For example, the voice device 110 corresponds to three pre-stored voiceprints vp1, vp2, vp3, the voice feature of the voice signal to be recognized is vf, the similarity scores of the voice feature vf and the pre-stored voiceprints vp1, vp2, vp3 are score1, score2, score3, and score1< score2< score3, and only score3 is associated with the identity of the voice device 110 for storage.

Thus, for a speech signal to be recognized, only one similarity score is ultimately stored.

Step2, classifying the similarity scores according to the classification threshold to judge whether the voice signal to be recognized and the prestored voiceprint correspond to the same user.

When the similarity score is larger than the classification threshold, judging that the voice signal to be recognized and the prestored voiceprint correspond to the same user; and when the similarity score is smaller than or equal to the classification threshold value, judging that the voice signal to be recognized and the prestored voiceprint correspond to different users.

For example, the voice device 110 corresponds to a pre-stored voiceprint vp, the voiceprint vp corresponds to a user of the registered user, the voice feature of the voice signal to be recognized is vf, the classification threshold is tv, and the similarity score between the voice feature vf and the pre-stored voiceprint vp is score. If score > tv, consider the speech signal to be recognized to be sent by the registered user; if score is less than or equal to tv, the speech signal to be recognized is not considered to be sent by the user of the registered user.

For another example, the voice device 110 corresponds to three pre-stored voiceprints vp1, vp2, vp3, the three voiceprints correspond to the registered users user1, user2, user3, the voice feature of the voice signal to be recognized is vf, the classification threshold is tv, and the similarity scores of the voice feature vf and the pre-stored voiceprints vp1, vp2, vp3 are score1, score2, score3. Judging whether score1, score2 and score3 are larger than tv or not respectively, and if score1> tv, considering that the voice signal to be recognized is sent out by a user 1; if score2> tv, then consider the speech signal to be recognized to be sent by registered user 2; if score3> tv, then consider the speech signal to be recognized to be sent by the registered user 3; if none of score1, score2, and score3 is greater than tv, then the speech signal to be recognized is not considered to be emitted by the registered users user1, user2, and user 3.

Step3. The recognition result obtained in Step2 is returned to the speech device 110.

It should be noted that Step3 is not necessary, and in some embodiments, the recognition result may be returned to the voice device 110. In other embodiments, instead of returning the recognition results to the voice device 110, the recognition results may be stored only in the server 120 for use in providing personalized services to the user. In still other embodiments, instead of returning the recognition result to the voice device 110, the personalized service provided according to the recognition result may be directly returned to the voice device, for example, the voice signal to be recognized collected by the voice device 110 indicates "please recommend a song to me" to the user, the server 120 recognizes the voice signal, confirms the user identity, and then, individually recommends a song to the user according to the user identity, and the server 120 may return only the recommended song to the voice device 110 for playing, instead of returning a specific user identity recognition result to the voice device 110.

FIG. 3 illustrates a flow chart of a voiceprint recognition method 300 in accordance with one embodiment of the present invention. The method 300 is performed in a server, corresponding to the threshold optimization portion of fig. 2. As shown in fig. 3, the method 300 begins at step S310.

In step S310, distribution of a plurality of similarity scores is counted. As previously described, the similarity score is used to represent the similarity of the speech signal to be identified to the pre-stored voiceprints.

According to one embodiment, when the number of the collected voice signals to be recognized is greater than the number threshold, the distribution of similarity scores of the voice signals to be recognized collected by the voice device is counted. That is, when the number of the voice signals to be recognized, which are collected by the voice equipment in a cumulative way, is greater than a number threshold, the classification threshold of the voice equipment is optimized; when the number of the voice signals to be recognized, which are collected by the voice equipment in a cumulative way, is smaller than or equal to the number threshold, the classification threshold of the voice equipment is not optimized, and a preset default threshold is always adopted.

Each speech signal to be identified is stored with a similarity score. When the voice equipment is just put into use, the voice signals to be recognized are fewer, correspondingly, the similarity scores correspondingly stored by the voice equipment are fewer, statistical analysis is carried out on a small number of similarity scores (statistical samples), the credible conclusion formed by the user of the voice equipment is difficult to obtain, and the updating effect of the classification threshold is poor. Therefore, when the voice signals to be recognized, which are collected by the voice equipment, are more, namely, the stored similarity scores are more (more than the quantity threshold), statistical analysis is performed on the similarity scores, and a more reliable statistical result can be obtained, so that a more optimal classification threshold is obtained. It should be noted that, the specific value of the number threshold may be set by those skilled in the art according to the actual situation, and the present invention does not limit the value of the number threshold, and in one embodiment, the number threshold may be set to 200.

The distribution of the plurality of similarity scores may be counted as follows steps S312, S314.

Step S312: a frequency profile or histogram of the plurality of similarity scores is determined.

The horizontal axis (x-axis) of the frequency distribution graph is the similarity score, and the vertical axis (y-axis) is the frequency. The frequency profile may be determined as follows: first, the maximum value and the minimum value in the similarity scores are determined, and the difference between the maximum value and the minimum value is calculated, namely the difference is the extreme difference. The similarity score is divided into groups according to the range, with the group distance (i.e., the distance between the two endpoints of each group) of each group being the same. The number of similarity scores included in each group, i.e., the frequency of each group, is counted. And drawing a frequency distribution diagram of the similarity score according to the group distance and the frequency.

The horizontal axis (x-axis) of the histogram is the similarity score and the vertical axis (y-axis) is the frequency. The frequency profile may be determined as follows: first, the maximum value and the minimum value in the similarity scores are determined, and the difference between the maximum value and the minimum value is calculated, namely the difference is the extreme difference. The similarity score is divided into groups according to the range, with the group distance (i.e., the distance between the two endpoints of each group) of each group being the same. The number of similarity scores included in each group, i.e., the frequency number of each group, is counted, and the frequency number of each group is divided by the total number of similarity scores, respectively, to obtain the frequency of each group. And drawing a frequency distribution map of the similarity score according to the group distance and the frequency.

Step S314: and (3) performing Gaussian curve fitting on the frequency distribution map or the frequency distribution map obtained in the step S312 to obtain at least one Gaussian curve.

The gaussian curve is a normal distribution curve, and the function expression is as follows:

where x is the similarity score, f (x) is the frequency or frequency, μ is the mean of the similarity scores, σ is the variance of the similarity scores, exp () represents an exponential function based on a natural constant e. The frequency distribution diagram is the same as the number of the Gaussian curves fitted by the frequency distribution diagram and the mean mu of each Gaussian curve, and only the variance sigma is different.

Subsequently, in step S320, the classification threshold is adjusted according to the distribution of the similarity scores. As described above, the classification threshold is used to classify the similarity score to determine whether the speech signal to be recognized corresponds to the same user as the pre-stored voiceprint.

According to one embodiment, the classification threshold may be adjusted as follows:

if the average value of all the Gaussian curves is larger than the current classification threshold value, updating the classification threshold value into a numerical value with the absolute value of the difference with the minimum similarity score on the Gaussian curves within a preset range. Typically, a gaussian having a mean value greater than the current classification threshold is formed from the similarity scores of registered users. In the step, the classification threshold is updated to be near the minimum similarity score on the Gaussian curve, so that the similarity score of the registered user can be ensured to have a higher probability than the classification threshold, and authentication is identified through voiceprint.

In one embodiment, the classification threshold may be updated to the minimum similarity score on the gaussian curve. In another embodiment, the classification threshold may be updated to a value that is less than the minimum similarity score on the gaussian by a first preset value, or to a value that is greater than the minimum similarity score on the gaussian by a second preset value. In yet another embodiment, the classification threshold may be set to a value greater than the minimum similarity score on the gaussian, and the area of the gaussian to the right of the updated classification threshold is made to account for 90% of the total coverage area of the gaussian, thus ensuring that registered users have a 90% probability of passing voiceprint recognition authentication.

If at least one of the gaussian curves has a mean value smaller than or equal to the current classification threshold, updating the classification threshold to a value between the minimum similarity score on the first gaussian curve and the maximum similarity score on the second gaussian curve, wherein the first gaussian curve is a gaussian curve with a mean value larger than the current classification threshold, and the second gaussian curve is a gaussian curve with a mean value smaller than or equal to the current classification threshold. Typically, a gaussian with a mean greater than the current classification threshold is formed from the similarity scores of registered users, and a gaussian with a mean less than or equal to the current classification threshold is formed from the similarity scores of non-registered users. In the step, the classification threshold is updated to be a numerical value between the minimum similarity score on the first Gaussian curve and the maximum similarity score on the second Gaussian curve, so that the registered user and the non-registered user can be better distinguished, and the misrecognition of the registered user are avoided.

In one embodiment, the classification threshold may be updated as an average of the minimum similarity score on the first gaussian curve and the maximum similarity score on the second gaussian curve. In another embodiment, the classification threshold may be updated to an average of a minimum mean of the first gaussian and a maximum mean of the second gaussian. Of course, the above is only two examples of updating the classification threshold, and the person skilled in the art may set the classification threshold to any value between the minimum similarity score of the first gaussian and the maximum similarity score on the second gaussian according to the actual needs.

Fig. 4 shows five embodiments of adjusting the classification threshold. In the frequency (frequency) profile a, the current classification threshold is x0. Only one gaussian 410a is fitted in figure a and the mean of the gaussian is greater than the current classification threshold x0. In this figure, the classification threshold may be updated to the minimum similarity score on the gaussian curve, i.e., the classification threshold is updated to a1.

In the frequency (frequency) profile b, the current classification threshold is x0. Two gaussian curves 410b, 420b are fitted in graph b, and the mean of both gaussian curves is greater than the current classification threshold x0. In this figure, the classification threshold may be updated to a value slightly less than the minimum similarity score on the gaussian curve. The smallest similarity score on the gaussian is b2, and the classification threshold is updated to a value b1 slightly smaller than b 2.

In the frequency (frequency) profile c, the current classification threshold is x0. In graph c, three gaussian curves 410c, 420c, 430c are fitted, the mean of the gaussian curves 410c and 420c being smaller than the current classification threshold x0, and the mean of the gaussian curve 430c being greater than the current classification threshold x0. In this figure, the classification threshold may be updated to be the average of the maximum similarity score c2 on gaussian 420c and the minimum similarity score c4 on gaussian 430c, i.e., the classification threshold is updated to c3=0.5 x (c2+c4).

In the frequency (frequency) profile d, the current classification threshold is x0. Two gaussian curves 410d, 420d are fitted in graph d, the mean of the gaussian curve 410d being smaller than the current classification threshold x0 and the mean of the gaussian curve 420d being greater than the current classification threshold x0. In this figure, the classification threshold may be updated to the average of the mean d1 of gaussian 410d and the mean d5 of gaussian 420d, i.e., the classification threshold is updated to d3=0.5 x (d1+d5).

In the frequency (frequency) profile e, the current classification threshold is x0. Three gaussian curves 410e, 420e, 430e are fitted in figure e, the mean of gaussian curves 410e and 420e being smaller than the current classification threshold x0, and the mean of gaussian curve 430e being greater than the current classification threshold x0. In this figure, the classification threshold may be updated to the average of the maximum mean e1 of gaussian curves 410e, 420e and the mean e5 of gaussian curve 430e, i.e., the classification threshold is updated to e3=0.5 x (e1+e5).

Fig. 5 illustrates a flow chart of a voiceprint recognition method 500 in accordance with one embodiment of the present invention. The method 500 is performed in a server, including the voiceprint recognition and threshold optimization process shown in fig. 2, and is intended to illustrate the order and logic of the voiceprint recognition and threshold optimization. As shown in fig. 5, the method 500 begins at step S510.

In step S510, a speech signal to be recognized is recognized to determine a similarity score for the speech signal.

In one embodiment, the speech signal to be recognized is collected by a speech device, which may be, for example, a smart speaker, but is not limited thereto. The similarity score of the speech signal may be determined according to Step1, which is not described herein.

When the number of collected voice signals to be recognized is small (less than or equal to the number threshold), the voice device performs only step S540, and does not perform steps S520 and S530, i.e., performs voiceprint recognition only on the collected voice signals, and does not optimize the classification threshold.

When the number of the collected voice signals to be recognized is large (greater than the number threshold value), the voiceprint recognition process of step S540 is executed in parallel with the threshold optimization processes of steps S520 and S530.

In step S540, the similarity score calculated in step S510 is classified according to a classification threshold to determine whether the voice signal to be recognized corresponds to the same user as the pre-stored voiceprint. It should be noted that the classification threshold value adopted in step S540 is a current classification threshold value, i.e., a classification threshold value that has not been updated. If the similarity score is larger than the classification threshold, judging that the voice signal to be recognized and the prestored voiceprint correspond to the same user; if the similarity score is smaller than or equal to the classification threshold, judging that the voice signal to be recognized and the prestored voiceprint correspond to different users. The specific implementation process of Step S540 may refer to the foregoing description of Step2, and will not be repeated here.

In step S520, a distribution of a plurality of similarity scores is counted. The specific implementation process of step S520 may refer to the foregoing description of step S310, which is not repeated here.

In step S530, the classification threshold is adjusted according to the distribution of the similarity scores. The specific implementation process of step S530 may refer to the foregoing description of step S320, which is not repeated here. The new classification threshold value obtained after adjustment can be used for voiceprint recognition of the next voice signal to be recognized acquired by the voice equipment.

As can be seen from the method 500, for a voice signal to be recognized newly collected by the voice device, voiceprint recognition needs to be performed on the voice signal according to the flow shown in steps S510 and S540 to obtain a recognition result, i.e. whether the voice signal and the pre-stored voiceprint correspond to the same user is determined. In addition, for the voice signal to be recognized newly collected by the voice device, the number of the voice signals to be recognized collected by the voice device is increased by one. If the number of the voice signals to be recognized collected by the voice equipment is smaller than or equal to the number threshold, the threshold optimization step in the steps S520 and S530 is not executed; if the number of the voice signals to be recognized collected by the voice device is greater than the number threshold, steps S520 and S530 are executed to update and optimize the classification threshold to determine a new classification threshold. The new classification threshold may be used to perform voiceprint recognition on the next speech signal to be recognized collected by the speech device.

Fig. 6 shows a schematic diagram of a voiceprint recognition process according to another embodiment of the present invention. The voiceprint recognition process shown in fig. 6 differs from fig. 2 only in that the voiceprint recognition process shown in fig. 6 is performed at the voice device 110, whereas the voiceprint recognition process shown in fig. 2 is performed at the server 120. The specific implementation of each step in fig. 6 may refer to the foregoing description related to fig. 2, and will not be repeated here.

Currently, voice devices are limited by portability and aesthetic features, and have low hardware configuration and weak computing power, so most computing processes are performed by a server side with strong computing power, as shown in fig. 2. If the hardware configuration of the future voice device increases and the computing power increases, the computing process may be completed at the voice device, as shown in fig. 6.

FIG. 7 shows a schematic diagram of a computing device 700 according to one embodiment of the invention. As shown in FIG. 7, in a basic configuration 702, a computing device 700 typically includes a system memory 706 and one or more processors 704. A memory bus 708 may be used for communication between the processor 704 and the system memory 706.

Depending on the desired configuration, the processor 704 may be any type of processing including, but not limited to: a microprocessor (μp), a microcontroller (μc), a digital information processor (DSP), or any combination thereof. Processor 604 may include one or more levels of cache, such as a first level cache 710 and a second level cache 712, a processor core 714, and registers 716. Example processor cores 714 may include Arithmetic Logic Units (ALUs), floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. An example memory controller 718 may be used with the processor 704, or in some implementations, the memory controller 718 may be an internal part of the processor 704.

Depending on the desired configuration, system memory 706 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 706 may include an operating system 720, one or more applications 722, and program data 724. The application 722 is in effect a number of program instructions for instructing the processor 704 to perform a corresponding operation. In some implementations, the application 722 may be arranged to cause the processor 704 to operate with program data 724 on an operating system.

Computing device 700 may also include an interface bus 740 that facilitates communication from various interface devices (e.g., output devices 742, peripheral interfaces 744, and communication devices 746) to the basic configuration 702 via a bus/interface controller 730. The example output devices 742 include a graphics processing unit 748 and an audio processing unit 750. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 752. Example peripheral interfaces 744 can include a serial interface controller 754 and a parallel interface controller 756, which can be configured to facilitate communication via one or more I/O ports 758 and external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.). Example communication devices 746 may include a network controller 760 that may be arranged to facilitate communications with one or more other computing devices 762 over network communication links via one or more communication ports 764.

The network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media in a modulated data signal, such as a carrier wave or other transport mechanism. A "modulated data signal" may be a signal that has one or more of its data set or changed in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or special purpose network, and wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR) or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

In a computing device 700 according to the present invention, an application 722 may include, for example, a voiceprint recognition apparatus 800, the apparatus 800 including a plurality of program instructions which may instruct a processor 704 to perform the voiceprint recognition method 300 of the present invention. Computing device 700 may be embodied as a voice device (e.g., a smart speaker, a smart television, etc.) or a server, but is not limited thereto.

Fig. 8 shows a schematic diagram of a voiceprint recognition device 800 in accordance with one embodiment of the present invention. The apparatus 800 may reside in a voice device (e.g., the voice device 110 described above) or a server (e.g., the server 120 described above) for performing the voiceprint recognition method 300 of the present invention. As shown in fig. 8, the voiceprint recognition apparatus 800 includes a statistics module 810 and a threshold optimization module 820.

The statistics module 810 is adapted to count a distribution of a plurality of similarity scores, wherein the similarity scores are used for representing the similarity between the voice signal to be identified and the pre-stored voiceprints. The statistics module 810 is specifically configured to perform the method as described in the foregoing step S310, and the processing logic and functions of the statistics module 810 can be referred to the relevant description of the foregoing step S310, which is not repeated herein.

The threshold optimization module 820 is adapted to adjust a classification threshold according to the distribution of the similarity scores, where the classification threshold is used to classify the similarity scores so as to determine whether the voice signal to be identified and the pre-stored voiceprint correspond to the same user. The threshold optimization module 820 is specifically configured to perform the method as described in the foregoing step S320, and the processing logic and functions of the threshold optimization module 820 can be referred to the relevant description of the foregoing step S320, which is not repeated herein.

According to one embodiment, the apparatus 800 further comprises a voiceprint recognition module 830. The voiceprint recognition module 830 is adapted to recognize a speech signal to be recognized to determine a similarity score for the speech signal. The voiceprint recognition module 830 is specifically configured to perform the method of steps Step1 to Step3, and the processing logic and functions of the voiceprint recognition module 830 can be referred to the relevant descriptions of steps Step1 to Step3, which are not repeated here.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions of the methods and apparatus of the present invention, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U-drives, floppy diskettes, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the data storage method and/or the data query method of the present invention in accordance with instructions in said program code stored in the memory.

By way of example, and not limitation, readable media comprise readable storage media and communication media. The readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with examples of the invention. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into a plurality of sub-modules.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

Furthermore, some of the embodiments are described herein as methods or combinations of method elements that may be implemented by a processor of a computer system or by other means of performing the functions. Thus, a processor with the necessary instructions for implementing the described method or method element forms a means for implementing the method or method element. Furthermore, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is for carrying out the functions performed by the elements for carrying out the objects of the invention.

As used herein, unless otherwise specified the use of the ordinal terms "first," "second," "third," etc., to describe a general object merely denote different instances of like objects, and are not intended to imply that the objects so described must have a given order, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of the above description, will appreciate that other embodiments are contemplated within the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is defined by the appended claims.

Claims

1. A voiceprint recognition method comprising:

counting distribution conditions of a plurality of similarity scores, wherein the similarity scores are used for representing similarity between a voice signal to be identified and a prestored voiceprint;

adjusting a classification threshold according to the distribution condition, wherein the classification threshold is used for classifying similarity scores to judge whether the voice signal to be recognized and the pre-stored voiceprint correspond to the same user,

wherein, the step of counting the distribution of the plurality of similarity scores includes:

determining a frequency profile or a frequency profile of the plurality of similarity scores;

performing Gaussian curve fitting on the frequency distribution map or the frequency distribution map to obtain at least one Gaussian curve;

said adjusting the classification threshold according to said distribution comprises: if the average value of all the Gaussian curves is larger than the current classification threshold value, updating the classification threshold value into a numerical value with the absolute value of the difference of the minimum similarity score on the Gaussian curves within a preset range.

2. The method of claim 1, wherein the speech signal to be identified is collected by a speech device, and the step of counting a distribution of a plurality of similarity scores comprises:

When the number of the collected voice signals to be recognized is larger than a number threshold value, the distribution situation of similarity scores of a plurality of voice signals to be recognized collected by the voice equipment is counted.

3. The method of claim 2, wherein the voice device comprises a smart speaker.

4. The method of claim 1, wherein said adjusting the classification threshold according to the distribution condition further comprises:

if at least one of the gaussian curves has a mean value smaller than or equal to the current classification threshold, updating the classification threshold to a value between a minimum similarity score on a first gaussian curve and a maximum similarity score on a second gaussian curve, wherein the first gaussian curve is a gaussian curve with a mean value larger than the current classification threshold, and the second gaussian curve is a gaussian curve with a mean value smaller than or equal to the current classification threshold.

5. The method of claim 4, wherein the updating the classification threshold to a value having an absolute value of a difference from a minimum similarity score on the gaussian curve within a preset range comprises:

updating the classification threshold to a minimum similarity score on the gaussian curve.

6. The method of claim 4, wherein the step of updating the classification threshold to a value between a minimum similarity score on a first gaussian curve and a maximum similarity score on a second gaussian curve comprises:

updating the classification threshold to be an average value of the minimum similarity score on the first Gaussian curve and the maximum similarity score on the second Gaussian curve; or (b)

And updating the classification threshold value to be an average value of the minimum mean value of the first Gaussian curve and the maximum mean value of the second Gaussian curve.

7. The method of claim 1, wherein the classifying the similarity score to determine whether the speech signal to be recognized corresponds to the same user as the pre-stored voiceprint comprises:

when the similarity score is larger than the classification threshold, judging that the voice signal to be recognized and the prestored voiceprint correspond to the same user;

and when the similarity score is smaller than or equal to the classification threshold value, judging that the voice signal to be recognized and the prestored voiceprint correspond to different users.

8. The method of claim 1, wherein prior to the step of counting the distribution of the plurality of similarity scores, further comprising:

And identifying the voice signals to be identified to determine similarity scores of the voice signals.

9. The method of claim 8, wherein the step of identifying the speech signal to be identified to determine a similarity score for the speech signal comprises:

extracting voice characteristics of the voice signal;

comparing the voice characteristics with pre-stored voiceprints to determine similarity scores of the voice signals;

and storing the similarity score of the voice signal in association with the identification of the voice equipment for collecting the voice signal.

10. The method of claim 8, wherein the step of identifying the speech signal to be identified to determine a similarity score for the speech signal comprises:

extracting voice characteristics of the voice signal;

when a plurality of pre-stored voiceprints exist, the voice features are respectively compared with each pre-stored voiceprint so as to obtain a plurality of similarity scores;

and storing the maximum value of the similarity scores in association with the identification of the voice equipment for collecting the voice signals.

11. A voiceprint recognition apparatus comprising:

the statistics module is suitable for determining a frequency distribution chart or a frequency distribution chart of a plurality of similarity scores, and performing Gaussian curve fitting on the frequency distribution chart or the frequency distribution chart to obtain at least one Gaussian curve, wherein the similarity scores are used for representing the similarity between a voice signal to be identified and a prestored voiceprint;

The threshold optimization module is adapted to adjust a classification threshold according to a distribution condition, wherein the classification threshold is used for classifying similarity scores to judge whether a voice signal to be recognized and a prestored voiceprint correspond to the same user, and the adjusting the classification threshold according to the distribution condition comprises the following steps: if the average value of all the Gaussian curves is larger than the current classification threshold value, updating the classification threshold value into a numerical value with the absolute value of the difference of the minimum similarity score on the Gaussian curves within a preset range; if at least one of the gaussian curves has a mean value smaller than or equal to the current classification threshold, updating the classification threshold to a value between a minimum similarity score on a first gaussian curve and a maximum similarity score on a second gaussian curve, wherein the first gaussian curve is a gaussian curve with a mean value larger than the current classification threshold, and the second gaussian curve is a gaussian curve with a mean value smaller than or equal to the current classification threshold.

12. A smart sound box adapted to perform the voiceprint recognition method of any one of claims 1-10.