WO2022236453A1 - 一种声纹识别方法、歌手认证方法、电子设备及存储介质 - Google Patents

一种声纹识别方法、歌手认证方法、电子设备及存储介质 Download PDF

Info

Publication number
WO2022236453A1
WO2022236453A1 PCT/CN2021/092291 CN2021092291W WO2022236453A1 WO 2022236453 A1 WO2022236453 A1 WO 2022236453A1 CN 2021092291 W CN2021092291 W CN 2021092291W WO 2022236453 A1 WO2022236453 A1 WO 2022236453A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
audio
user
similarity
target
Prior art date
Application number
PCT/CN2021/092291
Other languages
English (en)
French (fr)
Inventor
胡诗超
陈灏
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Priority to CN202180001166.3A priority Critical patent/CN113366567A/zh
Priority to PCT/CN2021/092291 priority patent/WO2022236453A1/zh
Publication of WO2022236453A1 publication Critical patent/WO2022236453A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • the present application relates to the technical field of biometrics, and in particular to a voiceprint recognition method, a singer authentication method, an electronic device and a storage medium.
  • a voiceprint is a sound wave spectrum that carries speech information displayed by an electroacoustic instrument. Through voiceprint recognition, it can be judged whether multiple audio inputters are the same person. Today, voiceprint recognition has been widely used in various scenarios such as device unlocking, bank transactions, and singer authentication.
  • the actual collected user audio and the target audio that needs to be compared with the user's audio are often determined, and the voiceprint is obtained by comparing the voiceprint features of the user audio with the voiceprint features of the target audio.
  • Feature similarity and use a fixed threshold to judge whether the voiceprint matches.
  • due to the uneven distribution of voiceprint similarity in the crowd it is difficult to use a fixed threshold to evaluate whether the voiceprint matches.
  • the purpose of this application is to provide a voiceprint recognition method, a singer authentication method, an electronic device and a storage medium, which can improve the accuracy of voiceprint recognition.
  • the present application provides a voiceprint recognition method, the voiceprint recognition method comprising:
  • determining the user voiceprint similarity between the target audio and the user audio, and the reference voiceprint similarity between the target audio and each of the multiple reference audios respectively includes:
  • calculating the user voiceprint similarity according to the user voiceprint feature vector and the target voiceprint feature vector includes:
  • calculating the similarity of the reference voiceprint according to the target voiceprint feature vector and the reference voiceprint feature vector includes:
  • the reference voiceprint similarity is calculated according to the cosine distance between the user voiceprint feature vector and the reference voiceprint feature vector.
  • the preset condition includes any one or a combination of clarity constraints, duration constraints and audio type constraints;
  • determining the target audio corresponding to the user audio includes:
  • the target audio is determined according to the musical composition of the target certified singer in the database.
  • determining the target audio according to the music works of the target certified singer in the database includes:
  • determining the target audio according to the music works of the target certified singer in the database includes:
  • Sound source separation is performed on the music works of the target certified singer in the database, and the human voice obtained by sound source separation is used as the target audio.
  • determining the user voiceprint similarity between the target audio and the user audio, and the reference voiceprint similarity between the target audio and each of the multiple reference audios respectively includes:
  • a plurality of reference audios are determined according to music works of N singers in the database, and a reference voiceprint similarity between the target audio and each of the reference audios is calculated.
  • judging whether the voiceprint of the user audio matches the target audio according to the distribution location includes:
  • judging whether the voiceprint of the user audio matches the target audio according to the distribution location includes:
  • This application also provides a singer authentication method, including:
  • the distribution position it is judged whether the voiceprint of the user's singing audio matches the singer's singing audio; if the voiceprint matches, it is determined that the target user has passed the singer authentication.
  • the present application also provides a storage medium on which a computer program is stored, and when the computer program is executed, the steps performed by the above-mentioned voiceprint recognition method are realized.
  • the present application also provides an electronic device, including a memory and a processor, wherein a computer program is stored in the memory, and when the processor invokes the computer program in the memory, the steps performed by the above voiceprint recognition method are realized.
  • the present application provides a voiceprint recognition method, including: receiving user audio, and determining the target audio corresponding to the user audio; determining the user voiceprint similarity between the target audio and the user audio, and the target audio Respectively with the reference voiceprint similarity of each reference audio in a plurality of reference audios; construct a similarity distribution model according to the reference voiceprint similarity between the target audio and each reference audio in a plurality of reference audios, and determine the The distribution position of the user's voiceprint similarity in the similarity distribution model; according to the distribution position, it is judged whether the voiceprint of the user's audio matches the target audio.
  • the application After receiving the user audio, the application determines the user voiceprint similarity between the user audio and the target audio, and also determines the reference voiceprint similarity between the reference audio and the target audio. Since there are large differences in the range and timbre of each person in the crowd, there are different voiceprint similarity distributions in the crowd for different target audio.
  • the present application constructs a similarity distribution model based on the similarity between the target audio and the reference voiceprints of each of the multiple reference audios, and judges the user's voiceprint similarity according to the distribution position of the similarity in the similarity distribution model. Whether the voiceprint of the user audio matches the target audio.
  • this application uses the distribution position of the similarity of the user's voiceprint in all reference voiceprint similarities to reflect the matching probability between the user's audio and the target audio, and realizes the following
  • the dynamic standard judges whether the voiceprint matches, which improves the accuracy of voiceprint recognition.
  • the present application also provides a singer authentication method, an electronic device and a storage medium, which have the above-mentioned beneficial effects, and will not be repeated here.
  • FIG. 1 is a structural diagram of a voiceprint recognition system provided by an embodiment of the present application
  • FIG. 2 is a flow chart of a voiceprint recognition method provided in an embodiment of the present application
  • FIG. 3 is a flow chart of a method for determining the distribution position of user voiceprint similarity provided by an embodiment of the present application
  • Fig. 4 is a kind of reference audio Gaussian distribution model provided by the embodiment of the present application.
  • FIG. 5 is a flow chart of a method for determining matching similarity information provided in an embodiment of the present application
  • FIG. 6 is a flowchart of an audio preprocessing method provided by an embodiment of the present application.
  • Fig. 7 is a flow chart of a singer authentication method provided by the embodiment of the present application.
  • Fig. 8 is a product-side interactive schematic diagram of a singer authentication method based on the probability of a reference crowd provided by the embodiment of the application;
  • FIG. 9 is a flow chart of a singer authentication method based on the probability of a reference crowd provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of the principle of a voiceprint recognition algorithm based on a reference group provided in the embodiment of the present application.
  • FIG. 11 is a structural diagram of an electronic device provided by an embodiment of the present application.
  • the voiceprint recognition process the actual collected user audio and the target audio that needs to be compared with the user's audio are often determined, and the voiceprint is obtained by comparing the voiceprint features of the user audio with the voiceprint features of the target audio. Feature similarity, and use a fixed threshold to judge whether the voiceprint matches.
  • the voiceprint feature similarity must reach 90% before the voiceprint match can be determined; another example is singer B If the vocal range and timbre of most people in the crowd are different, then the voiceprint matching can be judged when the similarity of voiceprint features reaches 70%. It can be seen that for different target audios, there are different standards for measuring whether the voiceprint matches.
  • the present application provides the following implementations, which can achieve the effect of improving the accuracy of voiceprint recognition.
  • FIG. 1 is an architecture diagram of a voiceprint recognition system provided by an embodiment of the present application.
  • the voiceprint recognition system includes a client 101, a computing device 102, and a database 103. Users can send data to the computing device through the client 101. After transmitting the user audio, the computing device 102 sends an audio acquisition request to the database after receiving the user audio, so as to obtain the target audio that the user needs to compare and the reference audio used to evaluate the similarity between the user audio and the target voiceprint.
  • the computing device can calculate the ranking probability of the user audio and the target audio in the crowd, and judge whether the voiceprints of the user audio and the target audio match based on the ranking probability.
  • FIG. 2 is a flow chart of a voiceprint recognition method provided in an embodiment of the present application. The specific steps may include:
  • S201 Receive user audio, and determine a target audio corresponding to the user audio
  • this embodiment can be applied to electronic devices such as smart phones, personal computers or servers.
  • the electronic device can be provided with a voice input module and utilize the voice input module to receive user audio input by the user in real time.
  • the electronic device can also communicate with other devices. Connect and receive user audio transmitted by other devices through wired or wireless methods.
  • the user audio is the audio of the user whose voiceprint needs to be recognized
  • the target audio is the audio whose voiceprint features need to be compared with the user's audio.
  • the target audio can be set according to the application scenario of the embodiment. For example, in the process of bank transactions, the user audio is the voice content of the trader, and the target audio is the voice content of the creator when the bank account is created; for example, in the process of singer applying for authentication, the user The audio is the voice content of the authentication requester, and the target audio is the song content of the singer whose authentication is requested.
  • this embodiment may also include an operation of acquiring a user authentication request, and determine the target audio corresponding to the user audio by analyzing the user authentication request.
  • this embodiment can also determine the target audio of the user audio according to the content of the user audio. The user audio determines the song to be sung, and determines the target audio corresponding to the user audio according to the song.
  • S202 Determine the user voiceprint similarity between the target audio and the user audio, and the reference voiceprint similarity between the target audio and each of the multiple reference audios;
  • this step there may also be an operation of randomly obtaining a reference audio from the database, and the reference audio may be any audio different from the target audio.
  • this embodiment can limit the number of reference audios to not less than a preset number, so as to calculate the reference voiceprint similarity between each reference audio and the target audio.
  • S203 Construct a similarity distribution model according to the similarity between the target audio and the reference voiceprints of each of the multiple reference audios, and determine the distribution position of the user's voiceprint similarity in the similarity distribution model ;
  • S204 Determine whether the voiceprint of the user audio matches the target audio according to the distribution position.
  • the reference voiceprint similarity between the reference audio and the target audio reflects the probability that the voiceprints of other people in the crowd are similar to the inputter of the target audio.
  • the distribution position of the user's voiceprint similarity in the similarity distribution model reflects the ranking probability of the user's voiceprint similarity among the crowd.
  • a similarity distribution model (such as a Gaussian model) can be established by referring to the reference voiceprint similarity value of the crowd, and the ranking probability is determined according to the distribution position of the user voiceprint similarity in the above similarity distribution model. The higher the probability that the user audio matches the target audio voiceprint, the higher the probability.
  • the distribution position of user voiceprint similarity in the similarity distribution model can be reflected by upper cumulative distribution (upper cumulative distribution, UCD).
  • UCD upper cumulative distribution
  • the voiceprint similarity ranking between the user audio and the target audio can be determined. The higher the ranking, the greater the probability that the user audio matches the target audio voiceprint.
  • this embodiment can determine whether the distribution position is within the preset position interval; if it is within the preset position interval, it can be determined that the The user audio matches the voiceprint of the target audio, that is, both the target audio and the user audio are audio input by the same user; if they are not within the preset position interval, the voiceprint of the user audio and the target audio can be determined No match, that is, the target audio and the user audio are both audio input by different users.
  • the user voiceprint similarity between the user audio and the target audio is determined, and the reference voiceprint similarity between the reference audio and the target audio is also determined. Since there are large differences in the range and timbre of each person in the crowd, there are different voiceprint similarity distributions in the crowd for different target audio.
  • a similarity distribution model is constructed based on the similarity between the target audio and the reference voiceprints of each of the multiple reference audios, and the judgment is made based on the distribution position of the user's voiceprint similarity in the similarity distribution model Whether the voiceprint of the user audio matches the target audio.
  • this embodiment uses the distribution position of the similarity of the user's voiceprint in all reference voiceprint similarities to reflect the matching probability between the user's audio and the target audio, and realizes The dynamic standard is used to judge whether the voiceprint matches, which improves the accuracy of voiceprint recognition.
  • FIG. 3 is a flow chart of a method for determining the distribution position of user voiceprint similarity provided by the embodiment of the present application.
  • This embodiment is a further introduction to S203 in the embodiment corresponding to FIG. 2, which can be This embodiment is combined with the embodiment corresponding to Fig. 2 to obtain a further implementation mode, and this embodiment may include the following steps:
  • S301 Construct a Gaussian distribution function according to the mean and variance of the reference voiceprint similarity between the target audio and each of the multiple reference audios;
  • a similarity set may be constructed according to the similarities of all reference voiceprints, a mean value and a variance of the similarity set may be determined, and a Gaussian distribution function may be constructed based on the mean value and variance.
  • S302 Calculate the upper cumulative distribution function value of the user voiceprint similarity in the Gaussian distribution function
  • S303 Determine the distribution position of the user voiceprint similarity in the Gaussian distribution function according to the value of the upper cumulative distribution function
  • this embodiment calculates the upper cumulative distribution function value of the user's voiceprint similarity in the Gaussian distribution function, and the upper cumulative distribution function value is used to describe the voiceprint similarity between the user's audio and the target audio
  • the ratio of the head position in all reference audios can be determined according to the upper cumulative distribution function value, the ratio of the voiceprint similarity between the reference audio and the target audio is not as good as that of the user audio.
  • Fig. 4 is a kind of reference audio Gaussian distribution model provided by the embodiment of the present application
  • the P point in Fig. 4 is the position of the voiceprint similarity of user audio and target audio in the Gaussian distribution function
  • the dotted line area is The upper cumulative distribution function value of the user voiceprint similarity in the Gaussian distribution function.
  • the Y axis is the probability that the random variable x is equal to a certain number
  • the X axis represents the random variable.
  • FIG. 5 is a flow chart of a method for determining matching similarity information provided by the embodiment of the present application.
  • This embodiment is a further introduction to S202 in the embodiment corresponding to FIG. 2.
  • This embodiment can be combined with The embodiment corresponding to Fig. 2 is combined to obtain a further implementation mode, and this embodiment may include the following steps:
  • S501 Determine the user voiceprint feature vector of the user audio, the target voiceprint feature vector of the target audio, and the reference voiceprint feature vector of each of the multiple reference audios;
  • S502 Calculate the user voiceprint similarity according to the user voiceprint feature vector and the target voiceprint feature vector
  • S503 Calculate the reference voiceprint similarity according to the user voiceprint feature vector and the reference voiceprint feature vector.
  • the voiceprint feature vector of the audio can be determined in various ways, for example, the user voiceprint feature vector, the target voiceprint feature vector and the reference voiceprint feature vector can be calculated based on neural network embedding, or can be based on statistical signals
  • the method of processing ivector calculates the user voiceprint feature vector, the target voiceprint feature vector and the reference voiceprint feature vector.
  • the voiceprint similarity may be determined according to the distance between the user voiceprint feature vector, the target voiceprint feature vector, and the reference voiceprint feature vector.
  • this embodiment can calculate the user voiceprint similarity according to the cosine distance between the user voiceprint feature vector and the target voiceprint feature vector, and can calculate the user voiceprint similarity according to the cosine distance between the user voiceprint feature vector and the reference voiceprint feature vector Calculate the similarity of the reference voiceprint.
  • the accuracy of the user's voiceprint similarity and the reference voiceprint similarity can be improved through the above method, thereby realizing high-precision voiceprint recognition.
  • Fig. 6 is a flow chart of an audio preprocessing method provided by the embodiment of the present application.
  • This embodiment is a further supplementary introduction to the corresponding embodiment in Fig. 2 after receiving user audio.
  • This embodiment can be Combined with the embodiment corresponding to Figure 2 to obtain a further implementation, this embodiment may include the following steps:
  • step S601 Determine whether the user's audio meets the preset condition; if yes, go to step S602; if not, go to step S603;
  • the preset conditions include any one or a combination of clarity constraints, duration constraints, and audio type constraints. Specifically, if there is no obvious noise or other irrelevant signals in the user's audio, it can be determined that the user's audio meets the clarity constraint; if the duration of the user's audio is within the preset duration interval, it can be determined that the user's audio meets the duration constraint; if the user If the audio is dry, it can be determined that the user's audio meets the music type constraint.
  • S602 Perform an operation of determining a target audio corresponding to the user audio
  • the operation of determining the target audio corresponding to the user audio can be continued to perform the related operations of S201-S204; Execute the operation of determining the target audio, and return the prompt message of audio recording failure to prompt the user to re-record the audio.
  • Invalid audio can be filtered out through the above-mentioned audio preprocessing operation, and the power consumption of the voiceprint recognition device can be reduced.
  • a corresponding weight value can be set for the distribution position and the similarity of the user's voiceprint, and the weighted calculation of the comprehensive score of the similarity can be used to determine whether the voiceprint matches, which further improves the accuracy of voiceprint recognition.
  • each distribution position has its corresponding ranking score. The higher the distribution position is, the higher the ranking score is.
  • the ranking score and user voiceprint similarity can be multiplied by the corresponding weight value, and the sum of the two as a composite score for similarity.
  • the user voiceprint similarity between the user audio and the target audio is 0.6
  • the distribution position of the user voiceprint similarity is the top 1%
  • the ranking score is 0.99
  • the distribution position of the user voiceprint similarity is the top 50%
  • the ranking score is 0.5
  • the above method can avoid the disadvantages of low recognition accuracy caused by only using a fixed threshold to judge the voiceprint similarity in the traditional solution.
  • This embodiment makes a comprehensive decision on whether the voiceprint matches based on the voiceprint similarity and distribution position. , improving the accuracy of voiceprint recognition.
  • this application provides a method for a user to authenticate a singer, the method includes the following steps:
  • Step 1 Receive the singer authentication request of the target user, determine the target authentication singer corresponding to the singer authentication request, and query the singer singing audio of the target authentication singer;
  • this embodiment can be applied to a music server, and after receiving the singer authentication request uploaded by the terminal device, it can be determined that the target singer wants to authenticate the singer, that is, the target authenticated singer.
  • the singer's singing audio of the target certified singer can be randomly extracted from the music library, and the representative works of the target certified singer can also be set as the singer's singing audio for voiceprint similarity comparison.
  • Step 2 receiving the user singing audio uploaded by the target user
  • Step 3 determine the user voiceprint similarity between the singer's singing audio and the user's singing audio, and the singer's singing audio and the reference voiceprint similarity of each reference singing audio in a plurality of reference singing audios;
  • songs sung by other singers can be selected from the music library as the reference singing audio
  • songs uploaded by other users can also be selected as the reference singing audio
  • songs sung by other singers and songs uploaded by other users can also be used as the reference singing audio. See Concert Audio.
  • Step 4 Construct a similarity distribution model according to the reference voiceprint similarity between the singer's singing audio and each reference singing audio in a plurality of reference singing audios, and determine the similarity of the user's voiceprint in the similarity distribution model distribution location in
  • Step 5 Determine whether the voiceprint of the user's singing audio matches the singer's singing audio according to the distribution location; if the voiceprint matches, determine that the target user has passed the singer's authentication.
  • a similarity distribution model is constructed according to the similarity of the singer's singing audio to the reference voiceprint of each reference singing audio in multiple reference singing audios, and according to the similarity of the user's voiceprint in the similarity distribution model The distribution position judges whether the voiceprint of the user's singing audio matches the singer's singing audio.
  • this embodiment uses the distribution position of the similarity of the user's voiceprint in all reference voiceprint similarities to reflect the matching probability of the user's singing audio and the singer's singing audio, It realizes the dynamic standard to judge whether the voiceprint matches, and improves the accuracy of voiceprint recognition.
  • Fig. 7 is a flowchart of a singer authentication method provided by the embodiment of the present application.
  • This embodiment is a solution for applying the above-mentioned voiceprint recognition operation to the singer authentication scenario.
  • This embodiment can be combined with the above-mentioned implementation Examples are combined to obtain a further implementation mode, and this embodiment may include the following steps:
  • S701 Receive an authentication request from a user, and determine a target authentication singer corresponding to the authentication request.
  • S702 Determine the target audio according to the music works of the target certified singer in the database.
  • the target audio can be determined according to any musical composition of the target authenticated singer, and the above-mentioned selected musical composition can be a complete musical composition or a fragment of a musical composition.
  • the music track corresponding to the target audio can be determined, the music composition of the music track sung by the target authentication singer can be queried from the database, and the music track can be determined according to the music composition. Describe the target audio.
  • the user audio uploaded by the user is a dry voice.
  • this embodiment can perform sound source separation on the music works of the target authentication singer, and use the human voice obtained by sound source separation as the target Audio, in order to achieve dry-sound-based voiceprint feature comparison.
  • S703 Calculate the user voiceprint similarity between the user audio and the target audio.
  • S704 Determine a plurality of reference audios according to the music works of N singers in the database, and calculate the similarity of reference voiceprints between the target audio and each reference audio.
  • N singers other than the target authentication singer can be randomly selected from the database, and the music works of the N singers can be used to determine the reference audio.
  • the musical works of the above-mentioned N singers may be complete musical works or fragments of musical works.
  • S705 Construct a similarity distribution model according to the similarity between the target audio and the reference voiceprints of each of the multiple reference audios, and determine the distribution position of the user's voiceprint similarity in the similarity distribution model , judging whether the voiceprint of the user audio matches the target audio according to the distribution position.
  • the distribution position of the user among the crowd and the target authentication singer can be determined, and the closer the distribution position is, the greater the possibility that the identity of the user is the target authentication singer.
  • FIG. 8 is a schematic diagram of interaction on the product side of a singer authentication method based on the probability of a reference crowd provided by an embodiment of the present application.
  • This embodiment provides an accurate and efficient singer authentication scheme for the situation that there is no actual singer claiming the music work in the music database.
  • This embodiment provides a kind of mode of rapid review when facing singer authentication, as shown in Figure 8, the user who requests authentication enters the authentication interface through a mobile terminal or a computer, first enters the singer information to be authenticated, and then the system returns the specified song for the user to sing. After the user finishes recording the dry voice, upload it to the background server, and the background server automatically verifies and matches the voiceprint characteristics of the recorded dry voice and the voiceprint characteristics of the singer's works to be certified in the music library.
  • Fig. 9 is a flow chart of a singer authentication method based on the probability of the reference crowd provided by the embodiment of the present application. This embodiment describes whether the background server determines whether the user is to be authenticated after receiving the dry voice uploaded by the user.
  • This embodiment may include the following steps:
  • Step 1 Dry sound classification preprocessing.
  • the above requirements may include: clear dry voice, no obvious noise, and other irrelevant signals (speech, etc.). If there is a lot of noise or silence in the dry sound recording and the effective duration is too short (for example, the duration is less than the threshold 7s), the corresponding authentication failure reason can be returned, and the user can be reminded to re-record the recording that meets the requirements.
  • Step 2 Calculate voiceprint features.
  • the voiceprint features of the uploaded dry voice and the voiceprint features of the corresponding works of the singer to be certified need to be calculated respectively.
  • the neural network model is used to calculate the voiceprint feature X_vocal of the dry voice.
  • the song works of the corresponding singer are searched in the music library, and the voiceprint feature X_singer of the singer's songs is calculated using the neural network model.
  • the sound source separation method can be used to first separate the accompaniment to extract the human voice and then calculate the voiceprint features, or directly calculate the voiceprint features without sound source separation.
  • Step 3 According to the similarity and probability distribution information of X_vocal and X_singer, return the authentication result.
  • the cosine distance, L2 distance or other distance corresponding to the voiceprint feature can be used to calculate the voiceprint similarity between X_vocal and X_singer.
  • the voiceprint similarity is greater than a certain threshold, the authentication can be considered successful, otherwise the authentication is considered failed.
  • this embodiment proposes a voiceprint recognition scheme based on a reference group, please refer to FIG. 10 , which is a schematic diagram of a principle of a voiceprint recognition algorithm based on a reference group provided by an embodiment of the present application.
  • the cosine similarity corr_A between the voiceprint feature A of the dry voice uploaded by the user and the voiceprint feature B of the singer to be authenticated can be calculated. Randomly select a sufficient number of singers (for example, 1000) C, D, E... from the crowd, and calculate the cosine similarity corr_C, corr_D, corr_E, corr_F... . Based on the similarity sets corr_C, corr_D, corr_E, corr_F... of the reference population, the mean value corr_MEAN and variance corr_VAR are calculated, and the Gaussian distribution function N(x, corr_MEAN, corr_VAR) is constructed based on the two.
  • a reasonable threshold (such as 0.15) can be set for the upper cumulative distribution function value to determine whether the current dry voice voiceprint feature A matches the voiceprint feature B of the singer to be authenticated, and if they match, it is determined that the singer authentication is successful.
  • the user who requests authentication only needs to upload a recorded a cappella recording for automatic identification.
  • the technology of machine learning/pattern recognition can be used to automate the certification review, and a scheme of referring to the probability distribution of the crowd is proposed to replace the traditional method of setting an absolute threshold to determine the identification.
  • This embodiment can also replace the traditional cumbersome steps that require manual review, which greatly saves manpower, and can quickly return the certification results, thereby increasing the attractiveness of the platform to singer certification, expanding the number of certified singers in the music library, and improving the influence of the platform force.
  • An embodiment of the present application also provides a voiceprint recognition device, which may include:
  • An audio determination module configured to receive user audio, and determine the target audio corresponding to the user audio
  • a similarity calculation module configured to determine the user voiceprint similarity between the target audio and the user audio, and the target audio and the reference voiceprint similarity of each of the multiple reference audios;
  • the distribution position determination module is to construct a similarity distribution model according to the similarity between the target audio and the reference voiceprints of each of the multiple reference audios, and determine the similarity of the user's voiceprint in the similarity distribution model distribution location in
  • a matching decision module configured to judge whether the voiceprint of the user audio matches the target audio according to the distribution position.
  • the user voiceprint similarity between the user audio and the target audio is determined, and the reference voiceprint similarity between the reference audio and the target audio is also determined. Since there are large differences in the range and timbre of each person in the crowd, there are different voiceprint similarity distributions in the crowd for different target audio.
  • the probability distribution information of the voiceprint similarity between the user audio and the target audio in the reference audio is determined according to the user voiceprint similarity and the reference voiceprint similarity, and the user audio and the target audio are judged according to the probability distribution information. Whether the voiceprint of the target audio matches.
  • this embodiment uses the probability distribution information of the similarity of voiceprints of user audio to reflect the matching probability of user audio and target audio, and realizes the use of dynamic criteria to judge voiceprints. Whether it matches or not improves the accuracy of voiceprint recognition.
  • the present application also provides a storage medium on which a computer program is stored. When the computer program is executed, the steps provided in the above-mentioned embodiments can be realized.
  • the storage medium may include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes.
  • the present application also provides an electronic device.
  • a structural diagram of an electronic device provided by an embodiment of the present application may include a processor 1110 and a memory 1120 .
  • the processor 1110 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • Processor 1110 can adopt at least one hardware form in DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 1110 may also include a main processor and a coprocessor, the main processor is a processor for processing data in the wake-up state, and is also called a CPU (Central Processing Unit, central processing unit); the coprocessor is Low-power processor for processing data in standby state.
  • CPU Central Processing Unit, central processing unit
  • the coprocessor is Low-power processor for processing data in standby state.
  • the processor 1110 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1110 may also include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 1120 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 1120 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the memory 1120 is at least used to store the following computer program 1121, wherein, after the computer program is loaded and executed by the processor 1110, it can implement the voiceprint recognition method and/or singer authentication method disclosed in any of the foregoing embodiments. related steps.
  • the resources stored in the memory 1120 may also include an operating system 1122 and data 1123, etc., and the storage method may be temporary storage or permanent storage.
  • the operating system 1122 may include Windows, Linux, Android and so on.
  • the electronic device may further include a display screen 1130 , an input/output interface 1140 , a communication interface 1150 , a sensor 1160 , a power supply 1170 and a communication bus 1180 .
  • the structure of the electronic device shown in FIG. 11 does not constitute a limitation on the electronic device in the embodiment of the present application.
  • the electronic device may include more or less components than those shown in FIG. 11 , or combine some part.

Abstract

一种声纹识别方法、一种歌手认证方法、一种电子设备及一种存储介质,所述声纹识别方法包括:接收用户音频,并确定所述用户音频对应的目标音频;确定所述目标音频和所述用户音频的用户声纹相似度,以及所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度;根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布模型中的分布位置;根据所述分布位置判断所述用户音频与所述目标音频是否声纹匹配。本申请能够以动态标准判断声纹是否匹配,提高了声纹识别的准确率。

Description

一种声纹识别方法、歌手认证方法、电子设备及存储介质 技术领域
本申请涉及生物识别技术领域,特别涉及一种声纹识别方法、一种歌手认证方法、一种电子设备及一种存储介质。
背景技术
声纹是用电声学仪器显示的携带言语信息的声波频谱,通过声纹识别能够判断多个音频的输入者是否为同一人。如今,声纹识别已经广泛应用于设备解锁、银行交易、歌手认证等多种场景。
在声纹识别过程中,往往确定实际采集的用户音频以及需要与用户音频进行声纹比对的目标音频,将用户音频的声纹特征与目标音频的声纹特征进行声纹比对得到声纹特征相似度,并利用固定阈值判断声纹是否匹配。但是由于人群中存在声纹相似度分布不均衡的情况,故难以用固定的阈值评价声纹是否匹配。
因此,如何提高声纹识别的准确率是本领域技术人员目前需要解决的技术问题。
发明内容
本申请的目的是提供一种声纹识别方法、一种歌手认证方法、一种电子设备及一种存储介质,能够提高声纹识别的准确率。
为解决上述技术问题,本申请提供一种声纹识别方法,该声纹识别方法包括:
接收用户音频,并确定所述用户音频对应的目标音频;
确定所述目标音频和所述用户音频的用户声纹相似度,以及所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度;
根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布 模型中的分布位置;
根据所述分布位置判断所述用户音频与所述目标音频是否声纹匹配。
可选的,根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布模型中的分布位置,包括:
根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度的均值和方差构建高斯分布函数;
计算所述用户声纹相似度在所述高斯分布函数中的上累计分布函数值,并根据所述上累计分布函数值确定所述用户声纹相似度在所述高斯分布函数中的分布位置。
可选的,确定所述目标音频和所述用户音频的用户声纹相似度,以及所述目标音频分别和多个参考音频中每一参考音频的参考声纹相似度,包括:
确定所述用户音频的用户声纹特征向量、所述目标音频的目标声纹特征向量和多个参考音频中每一所述参考音频的参考声纹特征向量;
根据所述用户声纹特征向量和所述目标声纹特征向量计算所述用户声纹相似度;
根据所述目标声纹特征向量和所述参考声纹特征向量计算所述参考声纹相似度。
可选的,根据所述用户声纹特征向量和所述目标声纹特征向量计算所述用户声纹相似度,包括:
根据所述用户声纹特征向量和所述目标声纹特征向量的余弦距离计算所述用户声纹相似度;
相应的,根据所述目标声纹特征向量和所述参考声纹特征向量计算所述参考声纹相似度,包括:
根据所述用户声纹特征向量和所述参考声纹特征向量的余弦距离计算所述参考声纹相似度。
可选的,在接收用户音频之后,还包括:
判断所述用户音频是否符合预设条件;其中,所述预设条件包括清晰 度约束条件、时长约束条件和音频类型约束条件中的任一项或任几项的组合;
若是,则执行确定所述用户音频对应的目标音频的操作;
若否,则返回音频录入失败的提示信息,并重新接收用户音频。
可选的,确定所述用户音频对应的目标音频,包括:
接收用户的认证请求,并确定所述认证请求对应的目标认证歌手;
根据数据库中所述目标认证歌手的音乐作品确定所述目标音频。
可选的,根据数据库中所述目标认证歌手的音乐作品确定所述目标音频,包括:
确定所述目标音频对应的音乐曲目;
从所述数据库中查询所述目标认证歌手演唱所述音乐曲目的音乐作品,并根据所述音乐作品确定所述目标音频。
可选的,根据数据库中所述目标认证歌手的音乐作品确定所述目标音频包括:
对所述数据库中所述目标认证歌手的音乐作品进行声源分离,并将声源分离得到的人声作为所述目标音频。
可选的,确定所述目标音频和所述用户音频的用户声纹相似度,以及所述目标音频分别和多个参考音频中每一参考音频的参考声纹相似度,包括:
计算所述用户音频和所述目标音频的用户声纹相似度;
根据数据库中N名歌手的音乐作品确定多个所述参考音频,并计算所述目标音频和每一所述参考音频的参考声纹相似度。
可选的,根据所述分布位置判断所述用户音频与所述目标音频是否声纹匹配,包括:
判断所述分布位置是否在预设位置区间内;
若是,则判定所述用户音频与所述目标音频的声纹匹配;
若否,则判定所述用户音频与所述目标音频的声纹不匹配。
可选的,根据所述分布位置判断所述用户音频与所述目标音频是否声纹匹配,包括:
对所述分布位置和所述用户声纹相似度进行加权计算,得到相似度综合得分;
判断所述相似度综合得分是否大于预设得分;
若是,则判定所述用户音频与所述目标音频的声纹匹配;
若否,则判定所述用户音频与所述目标音频的声纹不匹配。
本申请还提供了一种歌手认证方法,包括:
接收目标用户的歌手认证请求,确定所述歌手认证请求对应的目标认证歌手,并查询所述目标认证歌手的歌手演唱音频;
接收所述目标用户上传的用户演唱音频;
确定所述歌手演唱音频和所述用户演唱音频的用户声纹相似度,以及所述歌手演唱音频分别和多个参考演唱音频中每一参考演唱音频的参考声纹相似度;
根据所述歌手演唱音频分别与多个参考演唱音频中每一参考演唱音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布模型中的分布位置;
根据所述分布位置判断所述用户演唱音频与所述歌手演唱音频是否声纹匹配;若声纹匹配,则判定所述目标用户通过歌手认证。
本申请还提供了一种存储介质,其上存储有计算机程序,所述计算机程序执行时实现上述声纹识别方法执行的步骤。
本申请还提供了一种电子设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器调用所述存储器中的计算机程序时实现上述声纹识别方法执行的步骤。
本申请提供了一种声纹识别方法,包括:接收用户音频,并确定所述用户音频对应的目标音频;确定所述目标音频和所述用户音频的用户声纹相似度,以及所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度;根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布模型中的分布位置;根据所述分布位置判断所述用户音频与所述目 标音频是否声纹匹配。
本申请在接收到用户音频后,确定用户音频与目标音频的用户声纹相似度,还确定参考音频和目标音频的参考声纹相似度。由于人群中存在各个人的音域和音色差别较大的情况,故对于不同的目标音频,人群中存在不同的声纹相似度分布。本申请根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并根据用户声纹相似度在所述相似度分布模型中的分布位置判断所述用户音频与所述目标音频是否声纹匹配。相对于传统方案中完全依据固定阈值评价声纹相似度的方案,本申请利用用户声纹相似度的在所有参考声纹相似度中的分布位置反映用户音频与目标音频的匹配概率,实现了以动态标准判断声纹是否匹配,提高了声纹识别的准确率。本申请同时还提供了一种歌手认证方法、一种电子设备和一种存储介质,具有上述有益效果,在此不再赘述。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例所提供的一种声纹识别系统的架构图;
图2为本申请实施例所提供的一种声纹识别方法的流程图;
图3为本申请实施例所提供的一种用户声纹相似度的分布位置确定方法的流程图;
图4为本申请实施例所提供的一种参考音频高斯分布模型;
图5为本申请实施例所提供的一种匹配相似度信息确定方法的流程图;
图6为本申请实施例所提供的一种音频预处理方法的流程图;
图7为本申请实施例所提供的一种歌手认证方法的流程图;
图8为本申请实施例所提供的一种基于参考人群概率的歌手认证方法 的产品侧交互示意图;
图9为本申请实施例所提供的一种基于参考人群概率的歌手认证方法的流程图;
图10为本申请实施例所提供的一种基于参考人群的声纹识别算法的原理示意图;
图11为本申请实施例所提供的一种电子设备的结构图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在声纹识别过程中,往往确定实际采集的用户音频以及需要与用户音频进行声纹比对的目标音频,将用户音频的声纹特征与目标音频的声纹特征进行声纹比对得到声纹特征相似度,并利用固定阈值判断声纹是否匹配。但是由于人群中存在声纹相似度分布不均衡的情况,例如歌手A与人群中多数人的音域和音色相同,则需要声纹特征相似度达到90%才可以判定声纹匹配;再例如歌手B与人群中绝大多数人的音域和音色均不相同,则在声纹特征相似度达到70%即可以判定声纹匹配。由此可见对于不同的目标音频,存在不同的衡量声纹是否匹配的标准,上述基于固定阈值判断声纹是否匹配的方案的声纹识别准确率较低。为了解决上述声纹识别过程中存在的缺陷,本申请提供了以下几种实施方式,能够达到提高声纹识别准确率的效果。
为了便于理解本申请提供的声纹识别方法,下面对其使用的系统进行介绍。请参见图1,图1为本申请实施例提供的一种声纹识别系统的架构图,该声纹识别系统包括客户端101、计算设备102和数据库103,用户可以通过 客户端101向计算设备传输用户音频,计算设备102在接收到用户音频后向数据库发送音频获取请求,以便获取用户需要进行比对的目标音频以及用于评价用户音频与目标声纹相似度的参考音频。计算设备可以计算用户音频与目标音频在人群中的排位概率,并基于排位概率判断用户音频与目标音频的声纹是否匹配。
下面请参见图2,图2为本申请实施例所提供的一种声纹识别方法的流程图,具体步骤可以包括:
S201:接收用户音频,并确定用户音频对应的目标音频;
其中,本实施例可以应用于智能手机、个人计算机或服务器等电子设备,该电子设备可以设置有语音输入模块并利用该语音输入模块接收用户实时输入的用户音频,该电子设备也可以与其他设备通过有线或无线的方式连接并接收其他设备传输的用户音频。
用户音频为需要进行声纹识别的用户的音频,目标音频为需要与用户音频进行声纹特征比对的音频。目标音频可以根据实施例的应用场景设置,例如在银行交易的过程中用户音频为交易者的声音内容,目标音频为银行账户创建时创建者的声音内容;例如在歌手申请认证的过程中,用户音频为认证请求者的声音内容,目标音频为被请求认证的歌手的歌曲内容。
作为一种可行的实施方式,在确定目标音频对应的目标音频之前,本实施例还可以存在获取用户认证请求的操作,通过解析用户认证请求确定用户音频对应的目标音频。作为另一种可行的实施方式,本实施例还可以根据用户音频的内容确定用户音频的目标音频,例如在歌手申请认证的过程中,用户音频为用户现场演唱的一个歌曲,本实施例可以根据用户音频确定演唱曲目,并根据演唱曲目确定用户音频对应的目标音频。
S202:确定所述目标音频和所述用户音频的用户声纹相似度,以及所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度;
其中,在本步骤之前还可以存在从数据库中随机获取参考音频的操作,参考音频可以为与目标音频不同的任意音频。为了提高声纹识别的准确率,本实施例可以限定参考音频的数量不少于预设数量,以便分别计算每一参 考音频与目标音频的参考声纹相似度。
S203:根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布模型中的分布位置;
S204:根据所述分布位置判断所述用户音频与所述目标音频是否声纹匹配。
其中,参考音频与目标音频的参考声纹相似度反映了人群中其他人与目标音频的输入者的声纹相似概率。用户声纹相似度在所述相似度分布模型中的分布位置反映了用户声纹相似度在人群中的排位概率。具体的可以通过参考人群的参考声纹相似度值去建立相似度分布模型(如高斯模型),根据用户声纹相似度在上述相似度分布模型中的分布位置确定排位概率,排位概率越高用户音频与目标音频声纹匹配的概率越高。用户声纹相似度在所述相似度分布模型中的分布位置可以通过upper cumulative distribution(上累积分布,UCD)体现。根据相似度分布模型中的分布位置可以确定用户音频与目标音频的声纹相似度排名,排名越靠前,用户音频与目标音频声纹匹配的概率越大。
在得到了用户声纹相似度在相似度分布模型中分布位置的基础上,本实施例可以判断所述分布位置是否在预设位置区间内;若在预设位置区间内,则可以判定所述用户音频与所述目标音频的声纹匹配,即:目标音频和用户音频均为同一用户输入的音频;若不在预设位置区间内,则可以判定所述用户音频与所述目标音频的声纹不匹配,即目标音频和用户音频均为不同用户输入的音频。
本实施例在接收到用户音频后,确定用户音频与目标音频的用户声纹相似度,还确定参考音频和目标音频的参考声纹相似度。由于人群中存在各个人的音域和音色差别较大的情况,故对于不同的目标音频,人群中存在不同的声纹相似度分布。本实施例根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并根据用户声纹相似度在所述相似度分布模型中的分布位置判断所述用户音频与所述目标音频是否声纹匹配。相对于传统方案中完全依据固定阈值评价声纹相似 度的方案,本实施例利用用户声纹相似度的在所有参考声纹相似度中的分布位置反映用户音频与目标音频的匹配概率,实现了以动态标准判断声纹是否匹配,提高了声纹识别的准确率。
请参见图3,图3为本申请实施例所提供的一种用户声纹相似度的分布位置确定方法的流程图,本实施例是对图2对应的实施例中S203的进一步介绍,可以将本实施例与图2对应的实施例相结合得到进一步的实施方式,本实施例可以包括以下步骤:
S301:根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度的均值和方差构建高斯分布函数;
其中,本实施例可以根据所有参考声纹相似度构建相似度集合,确定相似度集合的均值和方差,并基于均值和方差构建高斯分布函数。
S302:计算所述用户声纹相似度在所述高斯分布函数中的上累计分布函数值;
S303:根据所述上累计分布函数值确定所述用户声纹相似度在所述高斯分布函数中的分布位置;
在得到高斯分布函数的基础上,本实施例计算用户声纹相似度在所述高斯分布函数中的上累计分布函数值,上累计分布函数值用于描述用户音频与目标音频的声纹相似度在所有参考音频中的头部位置占比,可以根据上累计分布函数值确定参考音频中与目标音频的声纹相似度不如用户音频的比例。
请参见图4,图4为本申请实施例所提供的一种参考音频高斯分布模型,图4中的P点为用户音频与目标音频的声纹相似度在高斯分布函数的位置,虚线区域为用户声纹相似度在高斯分布函数中的上累计分布函数值,图4中Y轴为随机变量x等于某数发生的概率,X轴代表随机变量。通过上述方式能够高效、准确地计算用户声纹相似度的分布位置,进而提高了声纹识别的效率和准确率。
请参见图5,图5为本申请实施例所提供的一种匹配相似度信息确定方法的流程图,本实施例是对图2对应的实施例中S202的进一步介绍,可以将本实施例与图2对应的实施例相结合得到进一步的实施方式,本实施例可以包括以下步骤:
S501:确定所述用户音频的用户声纹特征向量、所述目标音频的目标声纹特征向量和多个参考音频中每一所述参考音频的参考声纹特征向量;
S502:根据用户声纹特征向量和目标声纹特征向量计算用户声纹相似度;
S503:根据用户声纹特征向量和参考声纹特征向量计算参考声纹相似度。
上述实施例中,可以通过多种方式确定音频的声纹特征向量,例如可以基于神经网络embedding的方法计算用户声纹特征向量、目标声纹特征向量和参考声纹特征向量,也可以基于统计信号处理ivector的方法计算用户声纹特征向量、目标声纹特征向量和参考声纹特征向量。
进一步的,本实施例可以根据用户声纹特征向量、目标声纹特征向量和参考声纹特征向量之间的距离确定声纹相似度。作为一种可行的实施方式,本实施例可以根据用户声纹特征向量和目标声纹特征向量的余弦距离计算用户声纹相似度,可以根据用户声纹特征向量和参考声纹特征向量的余弦距离计算参考声纹相似度。通过上述方式能够提高用户声纹相似度和参考声纹相似度的准确度,进而实现高精度的声纹识别。
请参见图6,图6为本申请实施例所提供的一种音频预处理方法的流程图,本实施例是对图2对应实施例中接收用户音频之后的进一步补充介绍,可以将本实施例与图2对应的实施例相结合得到进一步的实施方式,本实施例可以包括以下步骤:
S601:判断用户音频是否符合预设条件;若是,则进入步骤S602;若否,则进入步骤S603;
其中,所述预设条件包括清晰度约束条件、时长约束条件和音频类型约束条件中的任一项或任几项的组合。具体的,若用户音频中无明显噪声 以及其他无关信号,则可以判定用户音频符合清晰度约束条件;若用户音频的时长在预设时长区间内,则可以判定用户音频符合时长约束条件;若用户音频为干声,则可以判定用户音频符合音乐类型约束条件。
S602:执行确定所述用户音频对应的目标音频的操作;
S603:返回音频录入失败的提示信息,并重新接收用户音频。
在本实施例中,若目标音频符合预设条件,则可以继续执行确定所述用户音频对应的目标音频的操作以便执行S201~S204的相关操作;若目标音频不符合预设条件,则可以不执行确定目标音频的操作,并返回音频录入失败的提示信息以便提示用户重新录入音频。通过上述音频预处理操作能够滤除无效音频,降低声纹识别设备的功耗。
进一步的,作为对于以上实施例的进一步介绍,还可以通过以下方式判断所述用户音频与所述目标音频是否声纹匹配:对所述分布位置和所述用户声纹相似度进行加权计算,得到相似度综合得分;判断所述相似度综合得分是否大于预设得分;若是,则判定所述用户音频与所述目标音频的声纹匹配;若否,则判定所述用户音频与所述目标音频的声纹不匹配。
具体的,本实施例可以为分布位置和用户声纹相似度设置对应的权重值,通过加权计算相似度综合得分判断声纹是否匹配,进一步提高了声纹识别的准确性。具体的,每一个分布位置都有其对应的排名得分,分布位置越靠前排名得分越高,可以将排名得分和用户声纹相似度分别与对应的权重值相乘,并将二者之和作为相似度综合得分。
举例说明上述方案:
设置分布位置的权重为0.6,用户声纹相似度的权重为0.4,当相似度综合得分大于0.8时判定声纹匹配。
若用户音频与目标音频的用户声纹相似度为0.6,用户声纹相似度的分布位置为前1%,排名得分为0.99,相似度综合得分为0.99*0.6+0.6*0.4=0.834。虽然该用户音频与目标音频的声纹相似度较低,但是由于目标音频的音域和声纹特征在人群中较为少见,当用户音频在人群中的分布位置较高时仍可以判定声纹匹配。
若用户音频与目标音频的用户声纹相似度为0.9,用户声纹相似度的分布位置为前50%,排名得分为0.5,相似度综合得分为0.5*0.6+0.9*0.4=0.66。虽然该用户音频与目标音频的声纹相似度较高,但是由于目标音频的音域和声纹特征在人群中较为常见,导致用户音频在人群中的分布位置较低,此时可以判定声纹不匹配。
由此可见,通过上述方式能够避免传统方案中仅对声纹相似度采用固定阈值进行判断导致识别精度较低的弊端,本实施例基于声纹相似度和分布位置对声纹是否匹配进行综合决策,提高了声纹识别的准确率。
在实际应用中,用于存储歌曲的数据库里面存在大量没有歌手入驻对应的作品,时常有歌手申请认领对应作品的归属身份的情况。相关技术中仅依靠声纹相似度实现歌手认证,但是由于人群中存在声纹相似度分布不均衡的情况,故难以用固定的阈值进行歌手认证。针对上述问题,本申请提供一种用户认证歌手的方法,该方法包括以下步骤:
步骤1:接收目标用户的歌手认证请求,确定所述歌手认证请求对应的目标认证歌手,并查询所述目标认证歌手的歌手演唱音频;
其中,本实施例可以应用于音乐服务器,在接收到终端设备上传的歌手认证请求后,可以确定目标歌手想要认证歌手,即目标认证歌手。本实施例可以从曲库中随机抽取目标认证歌手的歌手演唱音频,也可以将目标认证歌手的代表作设置为用于进行声纹相似度比对的歌手演唱音频。
步骤2:接收所述目标用户上传的用户演唱音频;
步骤3:确定所述歌手演唱音频和所述用户演唱音频的用户声纹相似度,以及所述歌手演唱音频分别和多个参考演唱音频中每一参考演唱音频的参考声纹相似度;
其中,本实施例可以从曲库中选取其他歌手演唱的歌曲作为参考演唱音频,也可以选取其他用户上传的歌曲作为参考演唱音频,还可以将其他歌手演唱的歌曲和其他用户上传的歌曲共同作为参考演唱音频。
步骤4:根据所述歌手演唱音频分别与多个参考演唱音频中每一参考演唱音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相 似度在所述相似度分布模型中的分布位置;
步骤5:根据所述分布位置判断所述用户演唱音频与所述歌手演唱音频是否声纹匹配;若声纹匹配,则判定所述目标用户通过歌手认证。
本实施例在接收到歌手认证请求后,确定用户演唱音频与歌手演唱音频的用户声纹相似度,还确定参考演唱音频和歌手演唱音频的参考声纹相似度。由于人群中存在各个人的音域和音色差别较大的情况,故对于不同的歌手演唱音频,人群中存在不同的声纹相似度分布。本实施例根据所述歌手演唱音频分别与多个参考演唱音频中每一参考演唱音频的参考声纹相似度构建相似度分布模型,并根据用户声纹相似度在所述相似度分布模型中的分布位置判断所述用户演唱音频与所述歌手演唱音频是否声纹匹配。相对于传统方案中完全依据固定阈值评价声纹相似度的方案,本实施例利用用户声纹相似度的在所有参考声纹相似度中的分布位置反映用户演唱音频与歌手演唱音频的匹配概率,实现了以动态标准判断声纹是否匹配,提高了声纹识别的准确率。
请参见图7,图7为本申请实施例所提供的一种歌手认证方法的流程图,本实施例是将上述声纹识别操作应用于歌手认证场景的方案,可以将本实施例与上述实施例相结合得到进一步的实施方式,本实施例可以包括以下步骤:
S701:接收用户的认证请求,并确定所述认证请求对应的目标认证歌手。
S702:根据数据库中所述目标认证歌手的音乐作品确定所述目标音频。
其中,本实施例可以根据目标认证歌手的任意音乐作品确定目标音频,上述选取的音乐作品可以为完整的音乐作品,也可以为音乐作品的片段。作为一种可行的实施方式,本实施例可以确定所述目标音频对应的音乐曲目,从所述数据库中查询所述目标认证歌手演唱所述音乐曲目的音乐作品,并根据所述音乐作品确定所述目标音频。
进一步的,用户上传的用户音频为干声,为了提高声纹识别的准确性,本实施例可以对目标认证歌手的音乐作品进行声源分离,并将声源分离得到的人声作为所述目标音频,以便实现基于干声的声纹特征对比。
S703:计算所述用户音频和所述目标音频的用户声纹相似度。
S704:根据数据库中N名歌手的音乐作品确定多个参考音频,并计算所述目标音频和每一参考音频的参考声纹相似度。
其中,本实施例可以从数据库中随机选取除了目标认证歌手之外N名的歌手,并将N名歌手的音乐作品确定所述参考音频。上述N名歌手的音乐作品可以为完整音乐作品,也可以为音乐作品片段。
S705:根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布模型中的分布位置,根据所述分布位置判断所述用户音频与所述目标音频是否声纹匹配。
通过上述方式能够确定用户在人群中与目标认证歌手的分布位置,分布位置越靠前用户的身份是该目标认证歌手的可能性越大。
下面通过在实际应用中的实施例说明上述实施例描述的流程。
请参见图8,图8为本申请实施例所提供的一种基于参考人群概率的歌手认证方法的产品侧交互示意图。本实施例针对曲库中无实际歌手认领音乐作品的情况,提供一种准确且高效的歌手认证方案。本实施例在面向歌手认证时提供一种快速审核的方式,如图8所示请求认证的用户通过手机终端或电脑进入认证界面,首先输入待认证歌手信息,然后系统返回指定曲目供用户唱,用户录制完干声后上传至后台服务器,后台服务器自动验证匹配录制干声的声纹特征和曲库待认证歌手作品的声纹特征。
请参见图9,图9为本申请实施例所提供的一种基于参考人群概率的歌手认证方法的流程图,本实施例描述了后台服务器接收到用户上传的干声后判断用户是否为待认证歌手的实现方式,本实施例可以包括以下步骤:
步骤1:干声分类预处理。
当请求认证的用户上传一段干声后,需要判断一下上传的干声是否符合要求。上述要求可以包括:干声清晰、无明显噪声、以及其他无关信号(说话声等等)。如果干声录制中有大量杂音或者静音以及有效时长过短(如时长小于阈值7s)时,可以返回对应的认证失败的原因,并提醒用户 重新录制符合要求的录音。
步骤2:计算声纹特征。
在本步骤中需要分别计算上传的干声的声纹特征和待认证歌手对应作品的声纹特征。具体的,当上传干声符合要求时,使用神经网络模型计算干声的声纹特征X_vocal。根据用户上传的待认证歌手id,在曲库中检索对应歌手的歌曲作品,并使用神经网络模型计算歌手歌曲作品的声纹特征X_singer。进一步的,基于歌手歌曲计算声纹特征时,可以使用声源分离方法先将伴奏分离提取人声后计算声纹特征,也可以不做声源分离直接计算声纹特征。
步骤3:根据X_vocal和X_singer的相似度和概率分布信息,返回认证结果。
其中,本步骤可以使用余弦距离、L2距离或其他对应声纹特征的距离计算X_vocal和X_singer的声纹相似度。在传统方案中,若声纹相似度大于某阈值时可以认为认证成功,否则认为认证失败。但是,在实际业务中发现不同歌手的适用阈值并不同,难以用一个通用的阈值作用于所有歌手。因此本实施例提出了一种基于参考人群的声纹识别方案,请参见图10,图10为本申请实施例所提供的一种基于参考人群的声纹识别算法的原理示意图。
如图10所示,本实施例可以计算用户上传干声声纹特征A与待认证歌手声纹特征B的余弦相似度corr_A。从人群中随机挑选足够数量的若干名歌手(例如1000名)C,D,E…,分别计算他们各自的声纹特征与待认证歌手声纹特征的余弦相似度corr_C,corr_D,corr_E,corr_F…。基于参考人群的相似度集合corr_C,corr_D,corr_E,corr_F…,计算均值corr_MEAN和方差corr_VAR,基于这两者构造高斯分布函数N(x,corr_MEAN,corr_VAR)。计算当前请求干声样本的余弦相似度corr_A在参考人群高斯模型中的上累计分布函数值(upper cumulative distribution,UCD),其数值意义为当下请求干声的相似度在广泛人群中的头部占比。例如若计算的上累计分布函数值为0.1,意味着当下的相似度在人群中能排名前10%,也即人群中90%的人与目标歌手的相似度都不如当下请求的用户干声样本。
本实施例可以为上累计分布函数值设置一个合理的阈值(如0.15)用于判断当前的干声声纹特征A与待认证歌手声纹特征B是否匹配,若匹配则判定歌手认证成功。
上述实施例提出的一种基于歌声音色识别的歌手认证的方案,请求认证的用户仅仅只需要上传一段录制的清唱录音即可进行自动识别。本实施例可以采用机器学习/模式识别的技术能自动化进行认证审核,并且提出参考人群概率分布的方案替代传统设置一个绝对阈值的方式来判定识别。本实施例也能替代传统需要人工审核的繁琐步骤,极大地节约了人力,并且能快速返回认证结果,进而增加平台对歌手认证的吸引力,扩大曲库中入驻认证歌手的数量,提高平台影响力。
本申请实施例还提供的一种声纹识别装置,该装置可以包括:
音频确定模块,用于接收用户音频,并确定所述用户音频对应的目标音频;
相似度计算模块,用于确定所述目标音频和所述用户音频的用户声纹相似度,以及所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度;
分布位置确定模块,由于根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布模型中的分布位置;
匹配决策模块,用于根据所述分布位置判断所述用户音频与所述目标音频是否声纹匹配。
本实施例在接收到用户音频后,确定用户音频与目标音频的用户声纹相似度,还确定参考音频和目标音频的参考声纹相似度。由于人群中存在各个人的音域和音色差别较大的情况,故对于不同的目标音频,人群中存在不同的声纹相似度分布。本实施例根据用户声纹相似度和参考声纹相似度确定用户音频与目标音频的声纹相似度在参考音频中的概率分布信息,并根据所述概率分布信息判断所述用户音频与所述目标音频是否声纹匹配。相对于传统方案中完全依据固定阈值评价声纹相似度的方案,本实施 例利用用户音频的声纹相似度的概率分布信息反映用户音频与目标音频的匹配概率,实现了以动态标准判断声纹是否匹配,提高了声纹识别的准确率。
本申请还提供了一种存储介质,其上存有计算机程序,该计算机程序被执行时可以实现上述实施例所提供的步骤。该存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请还提供了一种电子设备,参见图11,本申请实施例提供的一种电子设备的结构图,如图11所示,可以包括处理器1110和存储器1120。
其中,处理器1110可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1110可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1110也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1110可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1110还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1120可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1120还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。本实施例中,存储器1120至少用于存储以下计算机程序1121,其中,该计算机程序被处理器1110加载并执行之后,能够实现前述任一实施例公开的声纹识别方法和/或歌手认证方法中的相关步骤。另外,存储器1120所存储的资源还可以包括操作系统1122和数据1123等,存储方式可以是短暂存储或者永久存储。其中,操作系统1122可以包括Windows、Linux、Android 等。
在一些实施例中,电子设备还可包括有显示屏1130、输入输出接口1140、通信接口1150、传感器1160、电源1170以及通信总线1180。
当然,图11所示的电子设备的结构并不构成对本申请实施例中电子设备的限定,在实际应用中电子设备可以包括比图11所示的更多或更少的部件,或者组合某些部件。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的状况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (14)

  1. 一种声纹识别方法,其特征在于,包括:
    接收用户音频,并确定所述用户音频对应的目标音频;
    确定所述目标音频和所述用户音频的用户声纹相似度,以及所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度;
    根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布模型中的分布位置;
    根据所述分布位置判断所述用户音频与所述目标音频是否声纹匹配。
  2. 根据权利要求1所述声纹识别方法,其特征在于,根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布模型中的分布位置,包括:
    根据所述目标音频分别与多个参考音频中每一参考音频的参考声纹相似度的均值和方差构建高斯分布函数;
    计算所述用户声纹相似度在所述高斯分布函数中的上累计分布函数值,并根据所述上累计分布函数值确定所述用户声纹相似度在所述高斯分布函数中的分布位置。
  3. 根据权利要求1所述声纹识别方法,其特征在于,确定所述目标音频和所述用户音频的用户声纹相似度,以及所述目标音频分别和多个参考音频中每一参考音频的参考声纹相似度,包括:
    确定所述用户音频的用户声纹特征向量、所述目标音频的目标声纹特征向量和多个参考音频中每一所述参考音频的参考声纹特征向量;
    根据所述用户声纹特征向量和所述目标声纹特征向量计算所述用户声纹相似度;
    根据所述目标声纹特征向量和所述参考声纹特征向量计算所述参考声纹相似度。
  4. 根据权利要求3所述声纹识别方法,其特征在于,根据所述用户声纹特征向量和所述目标声纹特征向量计算所述用户声纹相似度,包括:
    根据所述用户声纹特征向量和所述目标声纹特征向量的余弦距离计算所述用户声纹相似度;
    相应的,根据所述目标声纹特征向量和所述参考声纹特征向量计算所述参考声纹相似度,包括:
    根据所述用户声纹特征向量和所述参考声纹特征向量的余弦距离计算所述参考声纹相似度。
  5. 根据权利要求1所述声纹识别方法,其特征在于,在接收用户音频之后,还包括:
    判断所述用户音频是否符合预设条件;其中,所述预设条件包括清晰度约束条件、时长约束条件和音频类型约束条件中的任一项或任几项的组合;
    若是,则执行确定所述用户音频对应的目标音频的操作;
    若否,则返回音频录入失败的提示信息,并重新接收用户音频。
  6. 根据权利要求1所述声纹识别方法,其特征在于,确定所述用户音频对应的目标音频,包括:
    接收用户的认证请求,并确定所述认证请求对应的目标认证歌手;
    根据数据库中所述目标认证歌手的音乐作品确定所述目标音频。
  7. 根据权利要求6所述声纹识别方法,其特征在于,根据数据库中所述目标认证歌手的音乐作品确定所述目标音频,包括:
    确定所述目标音频对应的音乐曲目;
    从所述数据库中查询所述目标认证歌手演唱所述音乐曲目的音乐作品,并根据所述音乐作品确定所述目标音频。
  8. 根据权利要求6所述声纹识别方法,其特征在于,根据数据库中所述目标认证歌手的音乐作品确定所述目标音频包括:
    对所述数据库中所述目标认证歌手的音乐作品进行声源分离,并将声源分离得到的人声作为所述目标音频。
  9. 根据权利要求6所述声纹识别方法,其特征在于,确定所述目标音频和所述用户音频的用户声纹相似度,以及所述目标音频分别和多个参考音频中每一参考音频的参考声纹相似度,包括:
    计算所述用户音频和所述目标音频的用户声纹相似度;
    根据数据库中N名歌手的音乐作品确定多个所述参考音频,并计算所述目标音频和每一所述参考音频的参考声纹相似度。
  10. 根据权利要求1至9任一项所述声纹识别方法,其特征在于,根据所述分布位置判断所述用户音频与所述目标音频是否声纹匹配,包括:
    判断所述分布位置是否在预设位置区间内;
    若是,则判定所述用户音频与所述目标音频的声纹匹配;
    若否,则判定所述用户音频与所述目标音频的声纹不匹配。
  11. 根据权利要求1至9任一项所述声纹识别方法,其特征在于,根据所述分布位置判断所述用户音频与所述目标音频是否声纹匹配,包括:
    对所述分布位置和所述用户声纹相似度进行加权计算,得到相似度综合得分;
    判断所述相似度综合得分是否大于预设得分;
    若是,则判定所述用户音频与所述目标音频的声纹匹配;
    若否,则判定所述用户音频与所述目标音频的声纹不匹配。
  12. 一种歌手认证方法,其特征在于,包括:
    接收目标用户的歌手认证请求,确定所述歌手认证请求对应的目标认证歌手,并查询所述目标认证歌手的歌手演唱音频;
    接收所述目标用户上传的用户演唱音频;
    确定所述歌手演唱音频和所述用户演唱音频的用户声纹相似度,以及所述歌手演唱音频分别和多个参考演唱音频中每一参考演唱音频的参考声纹相似度;
    根据所述歌手演唱音频分别与多个参考演唱音频中每一参考演唱音频的参考声纹相似度构建相似度分布模型,并确定所述用户声纹相似度在所述相似度分布模型中的分布位置;
    根据所述分布位置判断所述用户演唱音频与所述歌手演唱音频是否声纹匹配;若声纹匹配,则判定所述目标用户通过歌手认证。
  13. 一种电子设备,其特征在于,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器调用所述存储器中的计算机程序时实现 如权利要求1至12任一项所述方法的步骤。
  14. 一种存储介质,其特征在于,所述存储介质中存储有计算机可执行指令,所述计算机可执行指令被处理器加载并执行时,实现如权利要求1至12任一项所述方法的步骤。
PCT/CN2021/092291 2021-05-08 2021-05-08 一种声纹识别方法、歌手认证方法、电子设备及存储介质 WO2022236453A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180001166.3A CN113366567A (zh) 2021-05-08 2021-05-08 一种声纹识别方法、歌手认证方法、电子设备及存储介质
PCT/CN2021/092291 WO2022236453A1 (zh) 2021-05-08 2021-05-08 一种声纹识别方法、歌手认证方法、电子设备及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/092291 WO2022236453A1 (zh) 2021-05-08 2021-05-08 一种声纹识别方法、歌手认证方法、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022236453A1 true WO2022236453A1 (zh) 2022-11-17

Family

ID=77523042

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/092291 WO2022236453A1 (zh) 2021-05-08 2021-05-08 一种声纹识别方法、歌手认证方法、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN113366567A (zh)
WO (1) WO2022236453A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140348308A1 (en) * 2013-05-22 2014-11-27 Nuance Communications, Inc. Method And System For Speaker Verification
CN109257362A (zh) * 2018-10-11 2019-01-22 平安科技(深圳)有限公司 声纹验证的方法、装置、计算机设备以及存储介质
CN109684454A (zh) * 2018-12-26 2019-04-26 北京壹捌零数字技术有限公司 一种社交网络用户影响力计算方法及装置
CN110010159A (zh) * 2019-04-02 2019-07-12 广州酷狗计算机科技有限公司 声音相似度确定方法及装置
US10665244B1 (en) * 2018-03-22 2020-05-26 Pindrop Security, Inc. Leveraging multiple audio channels for authentication
CN111554303A (zh) * 2020-05-09 2020-08-18 福建星网视易信息系统有限公司 一种歌曲演唱过程中的用户身份识别方法及存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1808567A (zh) * 2006-01-26 2006-07-26 覃文华 验证真人在场状态的声纹认证设备和其认证方法
CN102404278A (zh) * 2010-09-08 2012-04-04 盛乐信息技术(上海)有限公司 一种基于声纹识别的点歌系统及其应用方法
CN103841108B (zh) * 2014-03-12 2018-04-27 北京天诚盛业科技有限公司 用户生物特征的认证方法和系统
CN105989842B (zh) * 2015-01-30 2019-10-25 福建星网视易信息系统有限公司 对比声纹相似度的方法、装置及其在数字娱乐点播系统中的应用
CN105656887A (zh) * 2015-12-30 2016-06-08 百度在线网络技术(北京)有限公司 基于人工智能的声纹认证方法以及装置
ES2912165T3 (es) * 2018-07-06 2022-05-24 Veridas Digital Authentication Solutions S L Autenticación de un usuario
CN111199729B (zh) * 2018-11-19 2023-09-26 阿里巴巴集团控股有限公司 声纹识别方法及装置
CN109243465A (zh) * 2018-12-06 2019-01-18 平安科技(深圳)有限公司 声纹认证方法、装置、计算机设备以及存储介质
CN109448725A (zh) * 2019-01-11 2019-03-08 百度在线网络技术(北京)有限公司 一种语音交互设备唤醒方法、装置、设备及存储介质
CN111444377A (zh) * 2020-04-15 2020-07-24 厦门快商通科技股份有限公司 一种声纹识别的认证方法和装置以及设备
CN112331217B (zh) * 2020-11-02 2023-09-12 泰康保险集团股份有限公司 声纹识别方法和装置、存储介质、电子设备
CN112614478B (zh) * 2020-11-24 2021-08-24 北京百度网讯科技有限公司 音频训练数据处理方法、装置、设备以及存储介质
CN112231510B (zh) * 2020-12-17 2021-03-16 北京远鉴信息技术有限公司 声纹存储方法、声纹查询方法、服务器及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140348308A1 (en) * 2013-05-22 2014-11-27 Nuance Communications, Inc. Method And System For Speaker Verification
US10665244B1 (en) * 2018-03-22 2020-05-26 Pindrop Security, Inc. Leveraging multiple audio channels for authentication
CN109257362A (zh) * 2018-10-11 2019-01-22 平安科技(深圳)有限公司 声纹验证的方法、装置、计算机设备以及存储介质
CN109684454A (zh) * 2018-12-26 2019-04-26 北京壹捌零数字技术有限公司 一种社交网络用户影响力计算方法及装置
CN110010159A (zh) * 2019-04-02 2019-07-12 广州酷狗计算机科技有限公司 声音相似度确定方法及装置
CN111554303A (zh) * 2020-05-09 2020-08-18 福建星网视易信息系统有限公司 一种歌曲演唱过程中的用户身份识别方法及存储介质

Also Published As

Publication number Publication date
CN113366567A (zh) 2021-09-07

Similar Documents

Publication Publication Date Title
US10497378B2 (en) Systems and methods for recognizing sound and music signals in high noise and distortion
WO2017113658A1 (zh) 基于人工智能的声纹认证方法以及装置
CN108897867A (zh) 用于知识问答的数据处理方法、装置、服务器和介质
JP6785904B2 (ja) 情報プッシュ方法及び装置
WO2021114841A1 (zh) 一种用户报告的生成方法及终端设备
WO2021047319A1 (zh) 基于语音的个人信用评估方法、装置、终端及存储介质
WO2022178969A1 (zh) 语音对话数据处理方法、装置、计算机设备及存储介质
CN107293307A (zh) 音频检测方法及装置
WO2021051681A1 (zh) 一种歌曲识别方法、装置、存储介质及电子设备
CN105679324A (zh) 一种声纹识别相似度评分的方法和装置
JP4143541B2 (ja) 動作モデルを使用して非煩雑的に話者を検証するための方法及びシステム
CN111737515B (zh) 音频指纹提取方法、装置、计算机设备和可读存储介质
WO2022236453A1 (zh) 一种声纹识别方法、歌手认证方法、电子设备及存储介质
US9384758B2 (en) Derivation of probabilistic score for audio sequence alignment
JP6996627B2 (ja) 情報処理装置、制御方法、及びプログラム
CN110489588B (zh) 音频检测方法、装置、服务器及存储介质
CN115083397A (zh) 歌词声学模型的训练方法、歌词识别方法、设备和产品
Wang et al. Speech emotion recognition using multiple classifiers
KR102530059B1 (ko) 메타버스 공간에서 구현되는 경연 콘텐츠의 아바타와 연계하여 제공되는 nft 기반의 서비스 제공 방법 및 장치
JP7287442B2 (ja) 情報処理装置、制御方法、及びプログラム
Wu et al. A Fingerprint and Voiceprint Fusion Identity Authentication Method
CN116631436A (zh) 性别识别模型处理方法、装置、计算机设备及存储介质
KR20240042796A (ko) 음성 기반 스트레스 판별 방법 및 장치
CN115658957A (zh) 基于模糊聚类算法的音乐旋律轮廓提取方法及装置
CN115034904A (zh) 交易准入审核方法、装置、设备、介质和程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21941037

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE