CN113366567A

CN113366567A - Voiceprint identification method, singer authentication method, electronic equipment and storage medium

Info

Publication number: CN113366567A
Application number: CN202180001166.3A
Authority: CN
Inventors: 胡诗超; 陈灏
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2021-09-07
Also published as: WO2022236453A1

Abstract

A voiceprint recognition method, a singer authentication method, an electronic device, and a storage medium, the voiceprint recognition method comprising: receiving user audio and determining a target audio corresponding to the user audio; determining user voiceprint similarity of the target audio and the user audio, and reference voiceprint similarity of the target audio and each reference audio in a plurality of reference audios; according to the similarity between the target audio and the reference voiceprint of each reference audio in the multiple reference audios, a similarity distribution model is built, and the distribution position of the user voiceprint similarity in the similarity distribution model is determined; and judging whether the user audio is matched with the target audio in a voiceprint mode or not according to the distribution position. Whether the voiceprints are matched or not can be judged according to the dynamic standard, and the accuracy of voiceprint identification is improved.

Description

Voiceprint identification method, singer authentication method, electronic equipment and storage medium

Technical Field

The present application relates to the field of biometric identification technologies, and in particular, to a voiceprint identification method, a singer authentication method, an electronic device, and a storage medium.

Background

The voiceprint is a sound wave frequency spectrum carrying speech information and displayed by an electro-acoustic instrument, and whether a plurality of audio input persons are the same person or not can be judged through voiceprint recognition. Nowadays, voiceprint recognition has been widely applied to various scenes such as equipment unlocking, bank transactions, singer authentication and the like.

In the voiceprint recognition process, user audio which is actually collected and target audio which needs to be subjected to voiceprint comparison with the user audio are often determined, voiceprint comparison is carried out on voiceprint characteristics of the user audio and voiceprint characteristics of the target audio to obtain voiceprint characteristic similarity, and whether voiceprints are matched or not is judged by utilizing a fixed threshold. However, since there is a case where the voiceprint similarity distribution is not uniform among people, it is difficult to evaluate whether voiceprints match or not with a fixed threshold.

Therefore, how to improve the accuracy of voiceprint recognition is a technical problem that needs to be solved by those skilled in the art at present.

Disclosure of Invention

An object of the present application is to provide a voiceprint recognition method, a singer authentication method, an electronic device, and a storage medium, which can improve the accuracy of voiceprint recognition.

In order to solve the above technical problem, the present application provides a voiceprint recognition method, including:

receiving user audio and determining a target audio corresponding to the user audio;

determining user voiceprint similarity of the target audio and the user audio, and reference voiceprint similarity of the target audio and each reference audio in a plurality of reference audios;

according to the similarity between the target audio and the reference voiceprint of each reference audio in the multiple reference audios, a similarity distribution model is built, and the distribution position of the user voiceprint similarity in the similarity distribution model is determined;

and judging whether the user audio is matched with the target audio in a voiceprint mode or not according to the distribution position.

Optionally, constructing a similarity distribution model according to the reference voiceprint similarity between the target audio and each of the multiple reference audios, and determining a distribution position of the user voiceprint similarity in the similarity distribution model, including:

constructing a Gaussian distribution function according to the mean value and the variance of the similarity of the target audio and the reference voiceprint of each of the multiple reference audios;

and calculating an upper cumulative distribution function value of the user voiceprint similarity in the Gaussian distribution function, and determining the distribution position of the user voiceprint similarity in the Gaussian distribution function according to the upper cumulative distribution function value.

Optionally, determining the user voiceprint similarity between the target audio and the user audio, and the reference voiceprint similarity between the target audio and each of the plurality of reference audios respectively includes:

determining a user voiceprint feature vector of the user audio, a target voiceprint feature vector of the target audio, and a reference voiceprint feature vector of each of a plurality of reference audios;

calculating the user voiceprint similarity according to the user voiceprint feature vector and the target voiceprint feature vector;

and calculating the reference voiceprint similarity according to the target voiceprint feature vector and the reference voiceprint feature vector.

Optionally, calculating the user voiceprint similarity according to the user voiceprint feature vector and the target voiceprint feature vector, including:

calculating the user voiceprint similarity according to the cosine distance between the user voiceprint feature vector and the target voiceprint feature vector;

correspondingly, calculating the reference voiceprint similarity according to the target voiceprint feature vector and the reference voiceprint feature vector, including:

and calculating the similarity of the reference voiceprint according to the cosine distance between the user voiceprint characteristic vector and the reference voiceprint characteristic vector.

Optionally, after receiving the user audio, the method further includes:

judging whether the user audio meets a preset condition or not; the preset condition comprises any one or a combination of any several of a definition constraint condition, a duration constraint condition and an audio type constraint condition;

if yes, executing the operation of determining the target audio corresponding to the user audio;

if not, returning the prompt information of audio input failure and re-receiving the user audio.

Optionally, determining the target audio corresponding to the user audio includes:

receiving an authentication request of a user, and determining a target authentication singer corresponding to the authentication request;

and determining the target audio according to the musical composition of the target authentication singer in the database.

Optionally, determining the target audio according to the musical composition of the target authenticated singer in the database includes:

determining a music track corresponding to the target audio;

and inquiring the music composition of the target authenticated singer singing the music composition from the database, and determining the target audio frequency according to the music composition.

Optionally, determining the target audio according to the musical composition of the target authenticated singer in the database comprises:

and carrying out sound source separation on the musical composition of the target authentication singer in the database, and taking the voice obtained by sound source separation as the target audio.

calculating the user voiceprint similarity of the user audio and the target audio;

and determining a plurality of reference audios according to the musical compositions of N singers in the database, and calculating the reference voiceprint similarity of the target audio and each reference audio.

Optionally, determining whether the user audio and the target audio are voiceprint matched according to the distribution position includes:

judging whether the distribution position is within a preset position interval or not;

if yes, judging that the voiceprint of the user audio is matched with the voiceprint of the target audio;

if not, judging that the voiceprints of the user audio and the target audio are not matched.

carrying out weighted calculation on the distribution positions and the user voiceprint similarity to obtain a comprehensive similarity score;

judging whether the comprehensive similarity score is greater than a preset score or not;

The application also provides a singer authentication method, which comprises the following steps:

receiving a singer authentication request of a target user, determining a target authentication singer corresponding to the singer authentication request, and inquiring singer singing audio of the target authentication singer;

receiving user singing audio uploaded by the target user;

determining the user voiceprint similarity of the singer singing audio and the user singing audio, and the reference voiceprint similarity of the singer singing audio and each reference singing audio in a plurality of reference singing audios;

constructing a similarity distribution model according to the similarity between the singer singing audio and the reference voiceprint of each reference singing audio in the multiple reference singing audios, and determining the distribution position of the user voiceprint similarity in the similarity distribution model;

judging whether the singing audio of the user is matched with the singing audio of the singer by voiceprints according to the distribution position; and if the voiceprints are matched, judging that the target user passes the singer authentication.

The application also provides a storage medium on which a computer program is stored, which when executed implements the steps performed by the above voiceprint recognition method.

The application also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps executed by the voiceprint recognition method when calling the computer program in the memory.

The application provides a voiceprint recognition method, which comprises the following steps: receiving user audio and determining a target audio corresponding to the user audio; determining user voiceprint similarity of the target audio and the user audio, and reference voiceprint similarity of the target audio and each reference audio in a plurality of reference audios; according to the similarity between the target audio and the reference voiceprint of each reference audio in the multiple reference audios, a similarity distribution model is built, and the distribution position of the user voiceprint similarity in the similarity distribution model is determined; and judging whether the user audio is matched with the target audio in a voiceprint mode or not according to the distribution position.

After receiving the user audio, the method and the device determine the user voiceprint similarity between the user audio and the target audio, and also determine the reference voiceprint similarity between the reference audio and the target audio. Because the range and the chromatic aberration of each person are large in the crowd, different voiceprint similarity distributions exist in the crowd for different target audios. According to the similarity of the target audio and the reference voiceprint of each reference audio in the multiple reference audios, a similarity distribution model is built, and whether the voiceprint of the user audio is matched with the target audio or not is judged according to the distribution position of the voiceprint similarity of the user in the similarity distribution model. Compared with the scheme of completely evaluating the voiceprint similarity according to the fixed threshold in the traditional scheme, the method and the device reflect the matching probability of the user audio and the target audio by utilizing the distribution positions of the user voiceprint similarity in all reference voiceprint similarities, realize that whether the voiceprints are matched or not is judged according to the dynamic standard, and improve the accuracy of voiceprint recognition. The application also provides a singer authentication method, an electronic device and a storage medium, which have the beneficial effects and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

Fig. 1 is an architecture diagram of a voiceprint recognition system provided in an embodiment of the present application;

fig. 2 is a flowchart of a voiceprint recognition method according to an embodiment of the present application;

fig. 3 is a flowchart of a method for determining distribution positions of user voiceprint similarity according to an embodiment of the present disclosure;

FIG. 4 is a Gaussian distribution model of reference audio provided by an embodiment of the present application;

fig. 5 is a flowchart of a method for determining matching similarity information according to an embodiment of the present disclosure;

fig. 6 is a flowchart of an audio preprocessing method according to an embodiment of the present application;

FIG. 7 is a flowchart of a singer authentication method according to an embodiment of the present application;

FIG. 8 is a schematic product-side interaction diagram illustrating a singer authentication method based on reference crowd probability according to an embodiment of the present application;

FIG. 9 is a flowchart of a singer authentication method based on reference crowd probability according to an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a voiceprint recognition algorithm based on a reference population according to an embodiment of the present application;

fig. 11 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the voiceprint recognition process, user audio which is actually collected and target audio which needs to be subjected to voiceprint comparison with the user audio are often determined, voiceprint comparison is carried out on voiceprint characteristics of the user audio and voiceprint characteristics of the target audio to obtain voiceprint characteristic similarity, and whether voiceprints are matched or not is judged by utilizing a fixed threshold. However, when the distribution of voiceprint similarity is not uniform in the crowd, for example, the vocal range and the tone of the singer A are the same as those of most people in the crowd, the voiceprint matching can be judged only when the similarity of the voiceprint features reaches 90%; for example, if the vocal range and the tone of the singer B are different from those of most people in the population, the voiceprint matching can be judged when the similarity of the voiceprint features reaches 70%. Therefore, different standards for judging whether the voiceprints are matched exist for different target audios, and the voiceprint identification accuracy of the scheme for judging whether the voiceprints are matched based on the fixed threshold is low. In order to overcome the defects existing in the voiceprint recognition process, the application provides the following several implementation modes, and the effect of improving the voiceprint recognition accuracy can be achieved.

In order to facilitate understanding of the voiceprint recognition method provided in the present application, a system for using the voiceprint recognition method is described below. Referring to fig. 1, fig. 1 is an architecture diagram of a voiceprint recognition system provided in an embodiment of the present application, where the voiceprint recognition system includes a client 101, a computing device 102, and a database 103, a user may transmit a user audio to the computing device through the client 101, and the computing device 102 sends an audio acquisition request to the database after receiving the user audio, so as to acquire a target audio that the user needs to compare with and a reference audio used for evaluating similarity between the user audio and the target voiceprint. The computing device may calculate a ranking probability of the user audio and the target audio in the crowd, and determine whether the voiceprints of the user audio and the target audio match based on the ranking probability.

Referring to fig. 2, fig. 2 is a flowchart of a voiceprint recognition method provided in the embodiment of the present application, and the specific steps may include:

s201: receiving user audio and determining target audio corresponding to the user audio;

the embodiment can be applied to electronic equipment such as a smart phone, a personal computer or a server, the electronic equipment can be provided with a voice input module and receive user audio input by a user in real time by using the voice input module, and the electronic equipment can also be connected with other equipment in a wired or wireless manner and receive the user audio transmitted by the other equipment.

The user audio is the audio of a user needing voiceprint recognition, and the target audio is the audio needing voiceprint characteristic comparison with the user audio. The target audio may be set according to an application scenario of the embodiment, for example, in the process of a bank transaction, the user audio is the sound content of the transactor, and the target audio is the sound content of the creator when the bank account is created; for example, in the process of applying for authentication by a singer, the user audio is the voice content of the authentication requester, and the target audio is the song content of the singer requested to be authenticated.

As a possible implementation manner, before determining the target audio corresponding to the target audio, the present embodiment may further include an operation of acquiring a user authentication request, and determining the target audio corresponding to the user audio by parsing the user authentication request. As another possible implementation manner, the present embodiment may further determine a target audio of the user audio according to the content of the user audio, for example, in the process of applying for authentication by a singer, the user audio is a song sung in the live of the user, and the present embodiment may determine a singing track according to the user audio and determine the target audio corresponding to the user audio according to the singing track.

S202: determining user voiceprint similarity of the target audio and the user audio, and reference voiceprint similarity of the target audio and each reference audio in a plurality of reference audios;

before this step, there may be an operation of randomly acquiring a reference audio from the database, where the reference audio may be any audio different from the target audio. In order to improve the accuracy of voiceprint recognition, the present embodiment may limit the number of the reference audios to be not less than a preset number, so as to respectively calculate the reference voiceprint similarity between each reference audio and the target audio.

S203: according to the similarity between the target audio and the reference voiceprint of each reference audio in the multiple reference audios, a similarity distribution model is built, and the distribution position of the user voiceprint similarity in the similarity distribution model is determined;

s204: and judging whether the user audio is matched with the target audio in a voiceprint mode or not according to the distribution position.

The reference voice print similarity of the reference audio and the target audio reflects the voice print similarity probability of other people in the crowd and the input person of the target audio. The distribution position of the user voiceprint similarity in the similarity distribution model reflects the ranking probability of the user voiceprint similarity in the crowd. Specifically, a similarity distribution model (such as a gaussian model) can be established through a reference voiceprint similarity value of a reference population, and ranking probability is determined according to the distribution position of the user voiceprint similarity in the similarity distribution model, wherein the higher the ranking probability is, the higher the probability of matching the user audio with the target audio voiceprint is. The distribution position of the user voiceprint similarity in the similarity distribution model can be embodied by Upper Cumulative Distribution (UCD). And determining the voiceprint similarity ranking of the user audio and the target audio according to the distribution position in the similarity distribution model, wherein the higher the ranking is, the higher the probability of matching the voiceprints of the user audio and the target audio is.

On the basis of obtaining the distribution positions of the user voiceprint similarity in the similarity distribution model, the embodiment can judge whether the distribution positions are in a preset position interval; if the audio is within the preset position interval, it may be determined that the voiceprint of the user audio matches the voiceprint of the target audio, that is: the target audio and the user audio are both audio input by the same user; if the target audio is not within the preset position interval, it can be judged that the voiceprints of the user audio and the target audio are not matched, namely the target audio and the user audio are both audio input by different users.

After receiving the user audio, the embodiment determines the user voiceprint similarity between the user audio and the target audio, and also determines the reference voiceprint similarity between the reference audio and the target audio. Because the range and the chromatic aberration of each person are large in the crowd, different voiceprint similarity distributions exist in the crowd for different target audios. In this embodiment, a similarity distribution model is constructed according to the similarity between the target audio and the reference voiceprint of each of the multiple reference audios, and whether the user audio is voiceprint-matched with the target audio is determined according to the distribution position of the user voiceprint similarity in the similarity distribution model. Compared with the scheme of completely evaluating the voiceprint similarity according to the fixed threshold in the traditional scheme, the method and the device for evaluating the voiceprint similarity reflect the matching probability of the user audio and the target audio by using the distribution positions of the user voiceprint similarity in all reference voiceprint similarities, realize the judgment of whether the voiceprints are matched according to the dynamic standard, and improve the accuracy of voiceprint recognition.

Referring to fig. 3, fig. 3 is a flowchart of a method for determining a distribution position of a voiceprint similarity of a user according to an embodiment of the present application, where this embodiment is a further description of S203 in the embodiment corresponding to fig. 2, and a further implementation manner can be obtained by combining this embodiment with the embodiment corresponding to fig. 2, where this embodiment may include the following steps:

s301: constructing a Gaussian distribution function according to the mean value and the variance of the similarity of the target audio and the reference voiceprint of each of the multiple reference audios;

in this embodiment, a similarity set may be constructed according to all the reference voiceprint similarities, a mean and a variance of the similarity set are determined, and a gaussian distribution function is constructed based on the mean and the variance.

S302: calculating an upper accumulated distribution function value of the user voiceprint similarity in the Gaussian distribution function;

s303: determining the distribution position of the user voiceprint similarity in the Gaussian distribution function according to the accumulated distribution function value;

on the basis of obtaining the gaussian distribution function, the present embodiment calculates an upper cumulative distribution function value of the user voiceprint similarity in the gaussian distribution function, where the upper cumulative distribution function value is used to describe the head position proportion of the voiceprint similarity between the user audio and the target audio in all reference audio, and may determine, according to the upper cumulative distribution function value, the proportion of the reference audio that the voiceprint similarity between the reference audio and the target audio is not better than the user audio.

Referring to fig. 4, fig. 4 is a gaussian distribution model of reference audio provided in an embodiment of the present application, where point P in fig. 4 is a position of a user audio and a target audio in a gaussian distribution function, a dotted line region is an accumulated distribution function value of the user voiceprint similarity in the gaussian distribution function, an axis Y in fig. 4 is a probability that a random variable X is equal to a certain number, and an axis X represents the random variable. By the method, the distribution position of the user voiceprint similarity can be efficiently and accurately calculated, and further the voiceprint recognition efficiency and accuracy are improved.

Referring to fig. 5, fig. 5 is a flowchart of a method for determining matching similarity information according to an embodiment of the present application, where the embodiment is further described with reference to S202 in the embodiment corresponding to fig. 2, and a further implementation manner may be obtained by combining the embodiment with the embodiment corresponding to fig. 2, where the embodiment may include the following steps:

s501: determining a user voiceprint feature vector of the user audio, a target voiceprint feature vector of the target audio, and a reference voiceprint feature vector of each of a plurality of reference audios;

s502: calculating the user voiceprint similarity according to the user voiceprint feature vector and the target voiceprint feature vector;

s503: and calculating the similarity of the reference voiceprint according to the user voiceprint feature vector and the reference voiceprint feature vector.

In the above embodiment, the voiceprint feature vector of the audio may be determined in various ways, for example, the user voiceprint feature vector, the target voiceprint feature vector and the reference voiceprint feature vector may be calculated based on a neural network embedding method, or the user voiceprint feature vector, the target voiceprint feature vector and the reference voiceprint feature vector may be calculated based on a statistical signal processing vector method.

Further, the present embodiment may determine the voiceprint similarity according to the distance between the user voiceprint feature vector, the target voiceprint feature vector, and the reference voiceprint feature vector. As a feasible implementation manner, the present embodiment may calculate the user voiceprint similarity according to the cosine distance between the user voiceprint feature vector and the target voiceprint feature vector, and may calculate the reference voiceprint similarity according to the cosine distance between the user voiceprint feature vector and the reference voiceprint feature vector. By the method, the accuracy of the user voiceprint similarity and the accuracy of the reference voiceprint similarity can be improved, and high-precision voiceprint recognition is further realized.

Referring to fig. 6, fig. 6 is a flowchart of an audio preprocessing method provided in an embodiment of the present application, where this embodiment is a further supplementary description of the embodiment corresponding to fig. 2 after receiving a user audio, and a further implementation manner can be obtained by combining this embodiment with the embodiment corresponding to fig. 2, where this embodiment may include the following steps:

s601: judging whether the user audio meets a preset condition or not; if yes, go to step S602; if not, go to step S603;

wherein the preset condition comprises any one or combination of definition constraint condition, duration constraint condition and audio type constraint condition. Specifically, if there is no obvious noise or other irrelevant signals in the user audio, it can be determined that the user audio meets the definition constraint condition; if the duration of the user audio is within the preset duration interval, judging that the user audio meets the duration constraint condition; if the user audio is dry sound, it can be determined that the user audio meets the music type constraint condition.

S602: executing the operation of determining the target audio corresponding to the user audio;

s603: and returning prompt information of audio input failure and re-receiving the user audio.

In this embodiment, if the target audio meets the preset condition, the operation of determining the target audio corresponding to the user audio may be continuously performed so as to perform the related operations of S201 to S204; if the target audio does not meet the preset condition, the operation of determining the target audio is not executed, and prompt information of audio input failure is returned so as to prompt the user to re-input the audio. Invalid audio can be filtered through the audio preprocessing operation, and the power consumption of the voiceprint recognition device is reduced.

Further, as a further introduction to the above embodiment, it may also be determined whether the user audio and the target audio are voiceprint matched by: carrying out weighted calculation on the distribution positions and the user voiceprint similarity to obtain a comprehensive similarity score; judging whether the comprehensive similarity score is greater than a preset score or not; if yes, judging that the voiceprint of the user audio is matched with the voiceprint of the target audio; if not, judging that the voiceprints of the user audio and the target audio are not matched.

Specifically, the embodiment can set corresponding weight values for the distribution positions and the voiceprint similarity of the user, and judge whether voiceprints are matched or not by weighting and calculating the comprehensive score of the similarity, so that the accuracy of voiceprint recognition is further improved. Specifically, each distribution position has a corresponding ranking score, the higher the ranking score is, the higher the distribution position is, the ranking score and the user voiceprint similarity can be multiplied by corresponding weight values respectively, and the sum of the ranking score and the user voiceprint similarity is used as a similarity comprehensive score.

The above scheme is illustrated by way of example:

setting the weight of the distribution position to be 0.6, setting the weight of the voiceprint similarity of the user to be 0.4, and judging the voiceprint matching when the comprehensive similarity score is greater than 0.8.

If the user voiceprint similarity between the user audio and the target audio is 0.6, the distribution position of the user voiceprint similarity is the top 1%, the ranking score is 0.99, and the comprehensive similarity score is 0.99 × 0.6+0.6 × 0.4 — 0.834. Although the degree of similarity between the voiceprints of the user audio and the target audio is low, since the range and the voiceprint characteristics of the target audio are rare in the crowd, the voiceprint matching can be still judged when the distribution position of the user audio in the crowd is high.

If the user voiceprint similarity between the user audio and the target audio is 0.9, the distribution position of the user voiceprint similarity is the top 50%, the ranking score is 0.5, and the comprehensive similarity score is 0.5 × 0.6+0.9 × 0.4-0.66. Although the voiceprint similarity between the user audio and the target audio is high, the distribution position of the user audio in the crowd is low due to the fact that the range and the voiceprint characteristics of the target audio are common in the crowd, and at the moment, the voiceprint mismatching can be judged.

Therefore, the defect that the recognition precision is low due to the fact that only the voiceprint similarity is judged by adopting the fixed threshold in the traditional scheme can be avoided, and the voiceprint recognition accuracy is improved by comprehensively deciding whether the voiceprints are matched or not based on the voiceprint similarity and the distribution positions.

In practical application, a large number of works which do not have singers to reside in a database for storing songs exist, and often, the singers apply for claiming the attribution identities of the corresponding works. In the related art, singer authentication is realized only by means of voiceprint similarity, but because voiceprint similarity distribution is unbalanced in the crowd, singer authentication is difficult to perform by using a fixed threshold value. In view of the above problems, the present application provides a method for authenticating a singer by a user, the method comprising the steps of:

step 1: receiving a singer authentication request of a target user, determining a target authentication singer corresponding to the singer authentication request, and inquiring singer singing audio of the target authentication singer;

the embodiment may be applied to a music server, and after receiving a singer authentication request uploaded by a terminal device, may determine that a target singer wants to authenticate the singer, that is, the target authenticated singer. In this embodiment, the singer singing audio of the target authenticated singer may be randomly extracted from the song library, or the representative of the target authenticated singer may be set as the singer singing audio for voiceprint similarity comparison.

Step 2: receiving user singing audio uploaded by the target user;

and step 3: determining the user voiceprint similarity of the singer singing audio and the user singing audio, and the reference voiceprint similarity of the singer singing audio and each reference singing audio in a plurality of reference singing audios;

in this embodiment, songs sung by other singers can be selected from the song library as the reference singing audio, songs uploaded by other users can also be selected as the reference singing audio, and songs sung by other singers and songs uploaded by other users can also be jointly used as the reference singing audio.

And 4, step 4: constructing a similarity distribution model according to the similarity between the singer singing audio and the reference voiceprint of each reference singing audio in the multiple reference singing audios, and determining the distribution position of the user voiceprint similarity in the similarity distribution model;

and 5: judging whether the singing audio of the user is matched with the singing audio of the singer by voiceprints according to the distribution position; and if the voiceprints are matched, judging that the target user passes the singer authentication.

In this embodiment, after receiving the singer authentication request, the user voiceprint similarity between the user singing audio and the singer singing audio is determined, and the reference voiceprint similarity between the reference singing audio and the singer singing audio is also determined. Because the range and the tone difference of each person are large in the crowd, different voiceprint similarity distributions exist in the crowd for different singers to sing the voice frequency. In this embodiment, a similarity distribution model is constructed according to the similarity between the singer singing audio and the reference voiceprint of each of the multiple reference singing audios, and whether the user singing audio is voiceprint matched with the singer singing audio is judged according to the distribution position of the user voiceprint similarity in the similarity distribution model. Compared with the scheme of completely evaluating the voiceprint similarity according to the fixed threshold in the traditional scheme, the method and the device for evaluating the voiceprint similarity reflect the matching probability of the singing audio of the user and the singer singing audio by utilizing the distribution positions of the voiceprint similarity of the user in all reference voiceprint similarities, realize the judgment of whether the voiceprints are matched according to the dynamic standard, and improve the accuracy of voiceprint recognition.

Referring to fig. 7, fig. 7 is a flowchart of a singer authentication method provided in an embodiment of the present application, where the present embodiment is a scheme of applying the voiceprint recognition operation to a singer authentication scenario, and a further implementation manner can be obtained by combining the present embodiment with the above embodiment, where the present embodiment may include the following steps:

s701: receiving an authentication request of a user, and determining a target authentication singer corresponding to the authentication request.

S702: and determining the target audio according to the musical composition of the target authentication singer in the database.

The target audio frequency can be determined according to any musical piece of the target authentication singer, and the selected musical piece can be a complete musical piece or a segment of the musical piece. As a possible implementation manner, this embodiment may determine a music track corresponding to the target audio, query the database for the music composition of the target authenticated singer singing the music track, and determine the target audio according to the music composition.

Furthermore, the user audio uploaded by the user is dry sound, in order to improve the accuracy of voiceprint recognition, the embodiment can perform sound source separation on the musical works of the target authenticated singer, and takes the voice obtained by sound source separation as the target audio, so as to realize voiceprint feature comparison based on the dry sound.

S703: and calculating the user voiceprint similarity of the user audio and the target audio.

S704: and determining a plurality of reference audios according to the musical works of N singers in the database, and calculating the reference voiceprint similarity of the target audio and each reference audio.

In this embodiment, N singers other than the target authentication singer may be randomly selected from the database, and the musical compositions of the N singers may be determined as the reference audio. The musical works of the N singers can be complete musical works or musical work segments.

S705: and constructing a similarity distribution model according to the similarity between the target audio and the reference voiceprint of each of the multiple reference audios, determining the distribution position of the user voiceprint similarity in the similarity distribution model, and judging whether the user audio is voiceprint matched with the target audio according to the distribution position.

By the method, the distribution positions of the users in the crowd and the target authentication singer can be determined, and the probability that the identity of the user is the target authentication singer is higher before the distribution positions are.

The flow described in the above embodiment is explained below by an embodiment in practical use.

Referring to fig. 8, fig. 8 is a schematic product-side interaction diagram of a singer authentication method based on reference crowd probability according to an embodiment of the present application. The embodiment provides an accurate and efficient singer authentication scheme aiming at the situation that no actual singer in a song library can claim musical works. The embodiment provides a fast auditing mode when singer authentication is oriented, a user requesting authentication as shown in fig. 8 enters an authentication interface through a mobile phone terminal or a computer, singer information to be authenticated is input firstly, then a system returns a designated song for the user to sing, the user uploads the recorded voice to a background server, and the background server automatically verifies and matches the voiceprint characteristics of the recorded voice and the voiceprint characteristics of the singer works to be authenticated in a song library.

Referring to fig. 9, fig. 9 is a flowchart of a singer authentication method based on reference crowd probability provided in an embodiment of the present application, where this embodiment describes an implementation manner in which a background server receives an utterance uploaded by a user and then determines whether the user is a singer to be authenticated, and this embodiment may include the following steps:

step 1: and (4) dry sound classification pretreatment.

After the user requesting authentication uploads a segment of dry sound, it needs to judge whether the uploaded dry sound meets the requirement. The above requirements may include: the dry sound is clear, no apparent noise, and other extraneous signals (speech, etc.). If a large amount of noise or silence exists in the dry sound recording and the effective time is too short (if the time is less than the threshold value of 7s), the corresponding reason of authentication failure can be returned, and the user is reminded to record the recording meeting the requirement again.

Step 2: and calculating the voiceprint characteristics.

In this step, the voiceprint characteristics of the uploaded dry voice and the voiceprint characteristics of the work corresponding to the singer to be authenticated need to be calculated respectively. Specifically, when the uploaded dry sound meets the requirement, the vocal print characteristic X _ vocal of the dry sound is calculated by using a neural network model. And searching song works corresponding to the singers in a song library according to the singer id to be authenticated uploaded by the user, and calculating the voiceprint characteristic X _ singer of the singer song works by using a neural network model. Further, when calculating the voiceprint features based on the singer song, the voiceprint features may be calculated after separating the accompaniment and extracting the voice by using a sound source separation method, or the voiceprint features may be directly calculated without performing sound source separation.

And step 3: and returning an authentication result according to the similarity and the probability distribution information of the X _ vocal and the X _ range.

In this step, the cosine distance, the L2 distance, or other distances corresponding to the voiceprint features may be used to calculate the voiceprint similarity of X _ vocal and X _ range. In the conventional scheme, if the voiceprint similarity is greater than a certain threshold, the authentication can be considered to be successful, otherwise, the authentication is considered to be failed. However, the applicable threshold values of different singers are not different in actual business, and it is difficult to apply a common threshold value to all singers. Therefore, the present embodiment provides a voiceprint recognition scheme based on reference population, please refer to fig. 10, and fig. 10 is a schematic diagram illustrating a principle of a voiceprint recognition algorithm based on reference population according to the present embodiment.

As shown in fig. 10, the present embodiment may calculate the cosine similarity corr _ a between the voiceprint feature a uploaded by the user and the voiceprint feature B of the singer to be authenticated. A sufficient number of singers (for example 1000 singers) C, D and E … are randomly selected from the crowd, and the cosine similarity corr _ C, corr _ D, corr _ E and corr _ F … of the voiceprint features of the singers to be authenticated and the voiceprint features of the singers to be authenticated are respectively calculated. Based on the similarity sets corr _ C, corr _ D, corr _ E and corr _ F … of the reference population, a MEAN corr _ MEAN and a variance corr _ VAR are calculated, and based on the two, a gaussian distribution function N (x, corr _ MEAN, corr _ VAR) is constructed. Calculating an upper cumulative distribution function value (UCD) of the cosine similarity corr _ a of the current request dry sound sample in the gaussian model of the reference population, wherein the numerical meaning of the UCD is the head proportion of the similarity of the current request dry sound in the wide population. For example, if the calculated upper cumulative distribution function value is 0.1, it means that the current similarity can be ranked 10% in the population, i.e., 90% of the population has less similarity to the targeted singer than the currently requested sample of the user's voice.

In this embodiment, a reasonable threshold (e.g. 0.15) may be set for the cumulative distribution function value to determine whether the current voiceprint feature a of the dry voice is matched with the voiceprint feature B of the singer to be authenticated, and if so, it is determined that the singer is authenticated successfully.

The scheme for singer authentication based on singing voice tone recognition provided by the embodiment can automatically recognize a user requesting authentication only by uploading a recorded singing record. The embodiment can adopt the machine learning/pattern recognition technology to automatically perform authentication and audit, and proposes a scheme of referring to the crowd probability distribution to replace the traditional mode of setting an absolute threshold value to judge the recognition. The embodiment can also replace the traditional complicated steps needing manual examination, greatly saves manpower, and can quickly return the authentication result, thereby increasing the attraction of the platform to singer authentication, expanding the number of singers who enter the authentication in the song library and improving the influence of the platform.

An embodiment of the present application further provides a voiceprint recognition apparatus, where the apparatus may include:

the audio determining module is used for receiving user audio and determining a target audio corresponding to the user audio;

the similarity calculation module is used for determining the user voiceprint similarity of the target audio and the user audio and the reference voiceprint similarity of the target audio and each reference audio in a plurality of reference audios;

the distribution position determining module is used for constructing a similarity distribution model according to the similarity between the target audio and the reference voiceprint of each reference audio in a plurality of reference audios and determining the distribution position of the user voiceprint similarity in the similarity distribution model;

and the matching decision module is used for judging whether the user audio is matched with the target audio in the voiceprint mode according to the distribution position.

After receiving the user audio, the embodiment determines the user voiceprint similarity between the user audio and the target audio, and also determines the reference voiceprint similarity between the reference audio and the target audio. Because the range and the chromatic aberration of each person are large in the crowd, different voiceprint similarity distributions exist in the crowd for different target audios. In this embodiment, probability distribution information of the voiceprint similarity between the user audio and the target audio in the reference audio is determined according to the user voiceprint similarity and the reference voiceprint similarity, and whether the voiceprint of the user audio is matched with the target audio is determined according to the probability distribution information. Compared with the scheme of completely evaluating the voiceprint similarity according to the fixed threshold in the traditional scheme, the method and the device for evaluating the voiceprint similarity reflect the matching probability of the user audio and the target audio by utilizing the probability distribution information of the voiceprint similarity of the user audio, realize the judgment of whether the voiceprints are matched or not according to the dynamic standard, and improve the accuracy of voiceprint recognition.

The present application also provides a storage medium having a computer program stored thereon, which when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The present application further provides an electronic device, and referring to fig. 11, a structure diagram of an electronic device provided in an embodiment of the present application may include a processor 1110 and a memory 1120, as shown in fig. 11.

The processor 1110 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1110 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1110 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1110 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 1110 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

The memory 1120 may include one or more computer-readable storage media, which may be non-transitory. The memory 1120 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 1120 is at least used for storing a computer program 1121, wherein after being loaded and executed by the processor 1110, the computer program can implement the relevant steps in the voiceprint recognition method and/or singer authentication method disclosed in any of the foregoing embodiments. In addition, the resources stored by the memory 1120 may include, among other things, an operating system 1122 and data 1123, which may be stored in a transitory or persistent manner. The operating system 1122 may include Windows, Linux, Android, and the like.

In some embodiments, the electronic device may also include a display screen 1130, input/output interface 1140, communication interface 1150, sensors 1160, power supply 1170, and communication bus 1180.

Of course, the structure of the electronic device shown in fig. 11 does not constitute a limitation of the electronic device in the embodiment of the present application, and the electronic device may include more or less components than those shown in fig. 11 or some components in combination in practical applications.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A voiceprint recognition method, comprising:

2. The voiceprint recognition method according to claim 1, wherein constructing a similarity distribution model according to the similarity between the target audio and the reference voiceprint of each of the plurality of reference audios, and determining the distribution position of the user voiceprint similarity in the similarity distribution model comprises:

3. The voiceprint recognition method according to claim 1, wherein determining the user voiceprint similarity between the target audio and the user audio, and the reference voiceprint similarity between the target audio and each of a plurality of reference audios respectively comprises:

4. The voiceprint recognition method according to claim 3, wherein calculating the user voiceprint similarity according to the user voiceprint feature vector and the target voiceprint feature vector comprises:

5. The voiceprint recognition method according to claim 1, further comprising, after receiving the user audio:

6. The voiceprint recognition method according to claim 1, wherein determining the target audio corresponding to the user audio comprises:

7. The voiceprint recognition method of claim 6 wherein determining the target audio from the musical composition of the target authenticated singer in the database comprises:

determining a music track corresponding to the target audio;

8. The voiceprint recognition method of claim 6 wherein determining the target audio from the musical composition of the target authenticated singer in the database comprises:

9. The voiceprint recognition method according to claim 6, wherein determining the user voiceprint similarity between the target audio and the user audio, and the reference voiceprint similarity between the target audio and each of a plurality of reference audios respectively comprises:

10. The voiceprint recognition method according to any one of claims 1 to 9, wherein determining whether the user audio and the target audio are voiceprint matched according to the distribution position comprises:

11. The voiceprint recognition method according to any one of claims 1 to 9, wherein determining whether the user audio and the target audio are voiceprint matched according to the distribution position comprises:

12. A singer authentication method, comprising:

receiving user singing audio uploaded by the target user;

13. An electronic device, comprising a memory in which a computer program is stored and a processor which, when called from the memory, carries out the steps of the method according to any one of claims 1 to 12.

14. A storage medium having stored thereon computer-executable instructions which, when loaded and executed by a processor, carry out the steps of a method according to any one of claims 1 to 12.