WO2022236827A1 - Voiceprint management method and apparatus - Google Patents

Voiceprint management method and apparatus Download PDF

Info

Publication number
WO2022236827A1
WO2022236827A1 PCT/CN2021/093917 CN2021093917W WO2022236827A1 WO 2022236827 A1 WO2022236827 A1 WO 2022236827A1 CN 2021093917 W CN2021093917 W CN 2021093917W WO 2022236827 A1 WO2022236827 A1 WO 2022236827A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
time window
voice signal
text
user
Prior art date
Application number
PCT/CN2021/093917
Other languages
French (fr)
Chinese (zh)
Inventor
张嘉祺
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180041437.8A priority Critical patent/CN115699168A/en
Priority to PCT/CN2021/093917 priority patent/WO2022236827A1/en
Publication of WO2022236827A1 publication Critical patent/WO2022236827A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Definitions

  • the present application relates to the technical field of speech recognition, in particular to a voiceprint management method and device.
  • voiceprint recognition technology has been widely used in many scenarios, such as vehicle network scenarios, smart home scenarios, and business processing scenarios.
  • the voiceprint recognition technology is to compare the voiceprint stored in the voiceprint management device with the collected voiceprint to judge the identity information of the user.
  • the user's reference voiceprint (that is, the voiceprint registered by the user in the voiceprint management device) is pre-stored in the voiceprint management device.
  • voiceprint recognition when the voiceprint management device acquires the collected voiceprint of the user, it can compare the collected voiceprint with the reference voiceprint, and then determine that the collected voiceprint and the reference voiceprint correspond to the same user.
  • the user's voiceprint may change with time.
  • the present application provides a voiceprint management method and device, which are used to timely and accurately update a reference voiceprint in a voiceprint management device, thereby improving recognition accuracy in voiceprint recognition.
  • the present application provides a voiceprint management method, which can be executed by a terminal device, such as a vehicle-mounted device (such as a car machine, a vehicle-mounted computer, etc.), and a user terminal (such as a mobile phone, a computer, etc.).
  • the voiceprint management method can also be implemented by components of the terminal equipment, such as processing devices, circuits, chips and other components in the terminal equipment, such as system chips or processing chips in the terminal equipment.
  • the system chip is also called a system on a chip, or a system on chip (system on chip, SOC) chip.
  • the method may also be executed by a server, and the server may include physical devices such as hosts or processors, or virtual devices such as virtual machines or containers, and may also include chips or integrated circuits.
  • the server is, for example, an IoV server, also known as a cloud server, cloud, cloud, cloud server, or cloud controller.
  • the IOV server may be a single server or a server cluster composed of multiple servers, which is not specifically limited.
  • the method may also be executed by components of the server, for example, implemented by components such as processing devices, circuits, and chips in the server.
  • the voiceprint management method provided by the present application includes: acquiring the verification result of the first voiceprint in the first time window, wherein the verification result of the first voiceprint is based on the first voiceprint, benchmark The average similarity of the voiceprint and the second time window is determined.
  • the first voiceprint is determined according to the first voice signal of the first user acquired in the first time window.
  • the average similarity of the second time window is used to indicate In the similarity between the voiceprint of the first user obtained in the second time window and the reference voiceprint, the starting point of the first time window is later than the starting point of the second time window; according to the verification of the first voiceprint in the first time window As a result, the reference voiceprint is updated.
  • the average similarity of the second time window is predetermined, and the average similarity of the second time window and the reference voiceprint are used together as a reference parameter to determine whether to update the reference voiceprint, It can avoid the problem of inaccurate judgment caused by inaccurate single registration voiceprint when the first user registers. Further, determine the similarities between the multiple first voiceprints in the first time window and the reference voiceprint respectively, and combine the average similarity in the second time window to determine whether the voiceprint of the first user has undergone permanent changes. When the voiceprint of the first user changes permanently, the reference voiceprint of the first user is updated; and when the voiceprint of the first user does not change permanently (such as a short-term change), the first user’s reference voiceprint is not updated. The user's baseline voiceprint. It helps to improve the accuracy of the reference voiceprint of the first user, thereby improving the recognition accuracy and robustness in voiceprint recognition.
  • the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including: When the similarity is greater than the average similarity of the second time window, the verification result of the first voiceprint is verified, and/or, when the similarity between the first voiceprint and the reference voiceprint is less than or equal to the average similarity of the second time window When the degree is higher, the verification result of the first voiceprint is failed.
  • the average similarity between the reference voiceprint and the second time window is used as the reference parameter for judging whether the first voiceprint passes the verification, so as to avoid accidental reasons when the first user registers.
  • the problem of inaccurate registered voiceprints caused by this will help to improve the accuracy of the verification results of the first voiceprint.
  • updating the reference voiceprint according to the verification result of the first voiceprint in the first time window includes: according to the ratio of the first voiceprint that passes verification in the first time window (that is, the first time window), update the reference voiceprint, wherein the ratio of the first voiceprint that has passed verification is the ratio of the number of first voiceprints that have passed verification in the first time window to the number of first voiceprints in the first time window ratio.
  • updating the reference voiceprint according to the ratio of the first voiceprint that passes verification in the first time window includes: the ratio of the first voiceprint that passes verification in the first time window is less than or equal to When the ratio threshold is reached, update the reference voiceprint.
  • the verification results of multiple first voiceprints can be obtained in the first time window, and then the verified ones can be determined according to the verification results of the multiple first voiceprints in the first time window.
  • the ratio of the first voiceprint which can be used to indicate whether the voiceprint of the first user has changed permanently, and then update or not update the reference voiceprint of the first user according to the ratio of the first voiceprint. Avoid updating the reference voiceprint of the first user by mistake due to the contingency of the verification result of a first voiceprint.
  • updating the reference voiceprint includes: acquiring the second voice signal of the first user, wherein the similarity between the second voice signal and the reference voiceprint is equal to the average similarity in the first time window The difference is less than the difference threshold, and the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; according to the second voice signal, update the reference voiceprint .
  • the similarity between the acquired second voice signal and the reference voiceprint meets the preset condition, which can be understood as, from the starting point of the first time window to the time when the second voice signal is acquired During the time period, if the change of the voiceprint of the first user is less than the change threshold (or it is understood that the voiceprint of the first user is in a stable state or has not changed for a long time), the reference voiceprint can be updated according to the second voice signal.
  • updating the reference voiceprint according to the second voice signal includes: obtaining the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, wherein the text corresponding to the de-emphasis voice signal It is non-repetitive text; update the reference voiceprint according to the de-emphasis voice signal.
  • obtaining the deduplication voice signal according to the second voice signal and the text corresponding to the second voice signal includes: performing a deduplication operation on the text corresponding to the second voice signal , to obtain the de-emphasis text; according to the voice signal corresponding to the de-emphasis text in the second voice signal, the de-emphasis voice signal is obtained.
  • obtaining the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal includes: for the i-th voice signal among the multiple second voice signals A second voice signal: according to the historical deduplication text, perform a deduplication operation on the text corresponding to the ith second voice signal; after the deduplication operation, the text corresponding to the i second voice signal and the historical deduplication text Splicing to obtain the deduplication text corresponding to the 1st voice signal to the i-th second voice signal; wherein the historical deduplication text is corresponding to the 1st second voice signal to the i-1th second voice signal respectively
  • the text is obtained, i is greater than 1; the de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text in the multiple second voice signals corresponding to the multiple second voice signals.
  • the second voice signal is deduplicated to obtain the deemphasized voice signal, thereby avoiding the There are high-frequency characters or high-frequency words that appear many times in the file, which affects the accuracy of the extracted reference voiceprint.
  • the duration of the de-emphasis speech signal is greater than the duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than the word-number threshold.
  • the method further includes: sliding the first time window, wherein an end point of the first time window after sliding is later than an end point of the first time window before sliding.
  • the first time window can be slid, and then the verification result of the first voiceprint in the slid first time window can be obtained, and according to the verification result of the first voiceprint in the slid first time window, For the verification result, determine the ratio of the first voiceprint that passes the verification in the first time window after sliding (that is, the compliance rate in the first time window after sliding). Wherein, the end point of the first time window after sliding is later than the end point of the first time window before sliding, and the interval between the two is shorter than the time length of the first time window.
  • the calculation of the compliance rate in the first time window is dynamic, and the compliance rate of the time span (that is, the first time window) will be determined every certain time interval, so that the long-term occurrence of the voiceprint of the first user can be found in time. Change, and update the registered voiceprint of the first user.
  • the length of the first time window is variable.
  • the first number threshold can be set, and when it is determined that the number of verification results of the first voiceprint in the first time window is less than the first number threshold, the first number threshold can be automatically extended.
  • the duration of a time window is until the number of determined verification results of the first voiceprint reaches the first number threshold. According to the verification result of the first voiceprint reaching the first number threshold, it is helpful to improve the accuracy of identifying the permanent change of the voiceprint of the first user.
  • it also includes: sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the second time window after sliding The starting point of the window is earlier than the starting point of the first time window after sliding.
  • the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding, including: when updating the reference voiceprint, sliding the second time window, wherein sliding The starting point of the subsequent second time window is no earlier than the time point of updating the reference voiceprint.
  • the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding, including: when updating the reference voiceprint, sliding the second time window, wherein sliding The starting point of the subsequent second time window is no earlier than the time point at which the second speech signal is acquired.
  • the second time window when updating the reference voiceprint, the second time window can be slid, and then the second voiceprint in the second time window after sliding can be acquired, and the second voiceprint in the second time window after sliding can be determined.
  • the similarity between the second voiceprint and the reference voiceprint is used to obtain the average similarity of the second time window after sliding. After updating the reference voiceprint, updating the average similarity in the second time window in time is helpful for subsequently determining whether the voiceprint of the first user has undergone permanent changes again.
  • the length of the second time window is variable.
  • a second times threshold may be set, and when it is determined that the number of similarities between the second voiceprint and the reference voiceprint in the second time window is less than the second times threshold, it may be The duration of the second time window is automatically extended until the number of determined similarities between the second voiceprint and the reference voiceprint reaches the second times threshold. According to the similarity corresponding to the second voiceprint reaching the threshold of the second number of times, it is helpful to improve the accuracy of the average similarity in the second time window.
  • the first voiceprint in the first time window before obtaining the verification result of the first voiceprint in the first time window, it further includes: determining the second voiceprint according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window. The similarity in the second time window; according to the multiple similarities in the second time window, determine the average similarity in the second time window.
  • the present application provides a voiceprint management device, which may be a terminal device, or a component (such as a processing device, circuit, chip, etc.) in the terminal device.
  • the terminal equipment includes vehicle-mounted equipment (such as a car machine, a vehicle-mounted computer, etc.), and a user terminal (such as a mobile phone, a computer, etc.).
  • the device may also be a server, or a component (such as a processing device, a circuit, a chip, etc.) in a server.
  • the server may include physical devices such as hosts or processors, virtual devices such as virtual machines or containers, and may also include chips or integrated circuits.
  • the server is, for example, an IoV server, also known as a cloud server, cloud, cloud server, or cloud controller.
  • the IOV server may be a single server or a server cluster composed of multiple servers, which is not specifically limited.
  • the voiceprint management device includes: an acquisition module and a processing module; the acquisition module is used to acquire the verification result of the first voiceprint in the first time window, wherein the first voiceprint
  • the verification result of is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, the first voiceprint is determined according to the first voice signal of the first user acquired in the first time window,
  • the average similarity of the second time window is used to indicate the similarity between the voiceprint of the first user acquired in the second time window and the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window; processing The module is used to update the reference voiceprint according to the verification result of the first voiceprint in the first time window.
  • the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including: When the similarity is greater than the average similarity of the second time window, the verification result of the first voiceprint is verified.
  • the processing module is specifically configured to: update the reference voiceprint according to the ratio of the verified first voiceprint in the first time window.
  • the processing module is specifically configured to: update the reference voiceprint when the ratio of the verified first voiceprint in the first time window is less than or equal to a ratio threshold.
  • the processing module is specifically configured to: control the acquisition module to acquire the second voice signal of the first user, wherein the similarity between the second voice signal and the reference voiceprint is similar to the average of the first time window degree difference is less than the difference threshold, the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; according to the second voice signal, update the reference Voiceprint.
  • the processing module is specifically configured to: obtain the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, wherein the text corresponding to the de-emphasis voice signal is non-repetitive text; Deduplicate the voice signal and update the reference voiceprint.
  • the processing module is specifically configured to: perform a deduplication operation on the text corresponding to the second voice signal to obtain the deduplication text;
  • the corresponding voice signal is obtained to obtain a de-emphasis voice signal.
  • the processing module is specifically configured to: for the ith second voice signal among the multiple second voice signals:
  • the text corresponding to the second voice signal performs a deduplication operation; after the deduplication operation, the text corresponding to the ith second voice signal is spliced with the historical deduplication text to obtain the 1st voice signal to the i th second voice signal Commonly corresponding deduplication text; historical deduplication text is obtained according to the texts corresponding to the 1st second voice signal to the i-1th second voice signal, i is greater than 1; according to the common correspondence of multiple second voices The voice signal corresponding to the de-emphasis text among the plurality of second voice signals is obtained to obtain the de-emphasis voice signal.
  • the duration of the de-emphasis speech signal is greater than the duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than the word-number threshold.
  • the processing module is further configured to: slide the first time window, wherein an end point of the first time window after sliding is later than an end point of the first time window before sliding.
  • the length of the first time window is variable.
  • the processing module is further configured to: slide the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the end point of the second time window after sliding is The starting point of the second time window is earlier than the starting point of the first time window after sliding.
  • the processing module is further configured to: Two voiceprints, determining the similarity in the second time window; determining the average similarity in the second time window according to the multiple similarities in the second time window.
  • the present application provides a computer program product, the computer program product includes computer programs or instructions, and when the computer program or instructions are executed by a computing device, the above first aspect or any possible implementation of the first aspect can be realized method in .
  • the present application provides a computer-readable storage medium, in which computer programs or instructions are stored, and when the computer programs or instructions are executed by a computing device, the above-mentioned first aspect or any of the first aspects can be realized.
  • the present application provides a computing device, including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the computing device performs the above-mentioned first aspect or A method in any possible implementation manner of the first aspect.
  • the embodiment of the present application provides a chip system, including: a processor and a memory, the processor and the memory are coupled, the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the chip system realizes The above first aspect or the method in any possible implementation manner of the first aspect.
  • the chip system further includes an interface circuit for exchanging code instructions to the processor.
  • processors in the chip system, and the processors may be implemented by hardware or by software.
  • the processor may be a logic circuit, an integrated circuit, or the like.
  • the processor may be a general-purpose processor, implemented by reading software codes stored in memory.
  • the memory can be integrated with the processor, or can be set separately from the processor.
  • the memory may be a non-transitory processor, such as a read-only memory ROM, which may be integrated with the processor on the same chip, or may be respectively provided on different chips.
  • the reference voiceprint of the first user is determined in advance, and then the second voiceprint of the first user is acquired in the second time window to determine the second voiceprint Based on the similarity with the reference voiceprint, an average similarity in the second time window is determined according to multiple similarities in the second time window.
  • the second time window is after the time point when the reference voiceprint is stored, that is, during the time period from the time point when the user registers the reference voiceprint to the end point of the second time window, the user's voiceprint is in a steady change, which can be based on The voiceprint of the first user that is changing steadily is used to obtain a reference parameter (that is, the average similarity between the reference voiceprint and the second time window).
  • the similarity between the first voiceprint and the reference voiceprint can be determined based on the user's first voiceprint in the first time window, and the similarity can be compared with the second Compare the average similarity of the time window to determine the verification result of the first voiceprint, that is, determine whether the first voiceprint in the first time window is similar to the second voiceprint in the second time window, and then determine the start of the second time window During the time period from the start point to the end point of the first time window, whether the user's voiceprint has undergone a long-term change.
  • the verification results of multiple first voiceprints in the first time window are determined, and the ratio of the first voiceprints that pass verification in the first time window (that is, the compliance rate in the first time window) is determined.
  • the compliance rate in the time window determines whether the user's voiceprint has changed for a long time, avoiding misjudgment caused by the user's accidental reasons in the first time window, and helping to improve the accuracy of the judgment.
  • the first time window can be slid to determine the ratio of the verified first voiceprint in the first time window after sliding (that is, the compliance rate in the first time window after sliding) , according to the compliance rate in the first time window after sliding, it is determined whether the voiceprint of the user has undergone a permanent change, which is helpful to timely discover the permanent change of the voiceprint of the first user and update the registered voiceprint of the first user.
  • the second voice signal is acquired, and the voiceprint of the first user is determined according to the acquired second voice signal, without interacting with the user, so that the user has no perception, which helps to improve user experience.
  • there is no need to pre-store multiple first voiceprints in the first time window which helps to reduce the amount of storage.
  • there is no need to store multiple second voiceprints in the second time window and only the average similarity of the second time window needs to be saved, further reducing the storage capacity.
  • Fig. 1a is a schematic structural diagram of a speech signal processing system provided by the present application.
  • Fig. 1b is a schematic flow diagram of a semantic understanding process exemplarily provided by the present application.
  • Fig. 1c is a schematic flow chart of a voiceprint extraction process exemplarily provided by the present application.
  • FIG. 2 is a schematic diagram of a vehicle scene provided by the present application.
  • FIG. 3 is a schematic diagram of another vehicle-mounted scene provided by the present application.
  • Fig. 4 is a schematic display diagram of a mobile phone interface exemplarily provided by the present application.
  • Fig. 5 is a schematic diagram of the corresponding relationship between a voiceprint management process and time provided by the present application
  • Fig. 6 is a schematic flow diagram of a voiceprint verification provided by the present application.
  • FIG. 7 is a schematic diagram of another voiceprint management process and the corresponding relationship between time provided by the present application.
  • Fig. 8 is a schematic flow chart of updating a reference voiceprint provided by the present application.
  • FIG. 9 is a schematic diagram of yet another vehicle-mounted scene exemplarily provided by the present application.
  • Fig. 10 is a schematic flowchart of another voiceprint management method exemplarily provided by the present application.
  • Fig. 11 is a schematic structural diagram of a voiceprint management device exemplarily provided by the present application.
  • Fig. 12 is a schematic structural diagram of another voiceprint management device provided by the present application as an example.
  • Fig. 1a is a schematic structural diagram of a voice signal processing system exemplarily provided in the present application, the voice signal processing system may at least include a voice collection device, a voiceprint management device and a semantic understanding device.
  • the voice signal processing system can collect the user's voice signal through the voice collection device, and input the voice signal into the semantic understanding device and the voiceprint management device respectively, wherein the semantic understanding device can be used to perform a semantic extraction process on the voice signal, and obtain The machine-recognizable instruction corresponding to the user's voice signal; the voiceprint management device can determine the user's voiceprint feature vector according to the voice signal, and perform a corresponding recognition process based on the voiceprint feature vector.
  • the voice collection device may be a microphone array (microphone array), and the microphone array may be composed of a certain number of acoustic sensors (usually microphones).
  • the microphone array can have one or more of the following functions: speech enhancement, the process of extracting pure speech from noisy speech signals; sound source localization, using the microphone array to calculate the angle and distance of the target speaker, so as to achieve target Speaker tracking and subsequent directional voice pickup; de-reverberation to reduce the impact of some reflected sound; sound source signal extraction/separation to extract all mixed sounds.
  • Microphone arrays can be applied to complex environments with multiple noises, noises, and echoes in vehicles, outdoors, and supermarkets.
  • the semantic understanding device may sequentially perform the following processing on the speech signal:
  • ASR Automatic speech recognition
  • the voice signal may be processed as a sound wave.
  • the voice signal is processed in frames to obtain a small segment of waveform corresponding to each frame.
  • the small segment of waveform is transformed into multi-dimensional vector information according to human ear characteristics, wherein the duration of each frame may be about 20ms to 30ms.
  • the multi-dimensional vector information multiple phonemes (phones) corresponding to the multi-dimensional vector information are decoded, and the multiple phonemes are formed into words and concatenated into sentences (ie, natural language text) for output.
  • NLP Natural language processing
  • Semantic slot filling Fill the structured information obtained from natural language processing into the corresponding slots, so that user intentions can be converted into machine-recognizable user instructions.
  • the voiceprint management device may include a voiceprint extraction module, and the voiceprint extraction module may be used to perform voiceprint extraction on voice signals to obtain corresponding voiceprint features vector.
  • the voiceprint extraction module may perform pre-processing (also called pre-processing), voiceprint feature extraction and post-processing on the voice signal in sequence.
  • Preprocessing extract the audio feature information in the speech signal.
  • at least one or more of the following operations may be performed on the speech signal: denoising, voice activity detection (voice activity detection, VAD), perceptual linear predictive (perceptual linear predictive) , PLP), Mel-frequency cepstral coefficients (MFCC) calculation.
  • VAD voice activity detection
  • PLP perceptual linear predictive
  • MFCC Mel-frequency cepstral coefficients
  • Voiceprint feature extraction the audio feature information is transmitted to the voiceprint feature extraction model, and correspondingly, the voiceprint feature extraction model outputs a voiceprint feature vector.
  • Voiceprint feature extraction models include but are not limited to: Gaussian mixed model (gaussian mixed model, GMM), joint factor analysis (joint factor analysis, JFA) model, i-vector (i-vector) model, d-vector (d-vector ) model, one or more items in the x-vector (x-vector) model.
  • the voiceprint feature vector may be referred to as voiceprint for short.
  • Post-processing Perform post-processing on the voiceprint output by the voiceprint feature extraction model to obtain the final voiceprint.
  • the post-processing may include one or more of the following: linear discriminant analysis (linear discriminant analysis, LDA), probabilistic linear discriminant analysis (probabilistic linear discriminant analysis, PLDA), disturbance attribute projection (nuisance attribute projection, NAP ).
  • the corresponding voiceprint can be extracted from the voice signal.
  • the voiceprint is the same as the face, fingerprint, iris, etc., and belongs to a kind of biometric information. According to the voiceprint, it can be determined to initiate the Identity information of the user of the voice signal. According to the voiceprint recognition of the user's identity information, compared with face recognition, it is not restricted by the user's facial occlusion, and compared with fingerprint recognition, it is not restricted by physical contact, and it is more implementable.
  • the voiceprint management device may also include a voiceprint recognition module, which can be used to identify the user's identity information according to the voiceprint.
  • the voiceprint recognition module pre-stores the user's reference voiceprint, and the voiceprint recognition module can compare the voiceprint (which can be called the collected voiceprint) from the voiceprint extraction module with the reference voiceprint. Compare to determine whether the collected voiceprint and the reference voiceprint correspond to the same user.
  • the voiceprint recognition module may determine whether the collected voiceprint and the reference voiceprint correspond to the same user according to the similarity threshold.
  • the voiceprint recognition module may take the similarity between registered voiceprints and collected voiceprints of N users as samples, and determine the similarity threshold based on the samples, where N is a positive integer.
  • the voiceprint recognition module can obtain N similarities between the registered voiceprint and the collected voiceprint of each user among the N users. Then a similarity threshold is determined according to the N similarities.
  • the voiceprint recognition module may use the average value or median value of N similarities as the similarity threshold.
  • the similarity threshold may be 0.75.
  • the voiceprint recognition module determines that the similarity between the collected voiceprint and the reference voiceprint is greater than the similarity threshold, it can determine that the collected voiceprint and the reference voiceprint correspond to the same user. If the voiceprint recognition module determines that the similarity between the collected voiceprint and the reference voiceprint is less than or equal to the similarity threshold, it can determine that the collected voiceprint and the reference voiceprint correspond to different users.
  • the similarity between the collected voiceprint and the reference voiceprint can be greater than the similarity threshold, which means that the collected voiceprint and the reference voiceprint match; If the similarity is less than or equal to the similarity threshold, it is understood that the collected voiceprint and the reference voiceprint do not match.
  • the voiceprint recognition module may also determine that the collected voiceprint and the reference voiceprint correspond to the same user.
  • the voiceprint recognition module determines that the collected voiceprint and the reference voiceprint correspond to different users when the similarity between the collected voiceprint and the reference voiceprint is determined to be less than the similarity threshold.
  • the similarity between the collected voiceprint and the reference voiceprint can be greater than or equal to the similarity threshold, which means that the collected voiceprint and the reference voiceprint match; If the similarity between the fingerprints is less than the similarity threshold, it is understood that the collected voiceprint and the reference voiceprint do not match.
  • the first example is taken as an example.
  • the reference with the greatest similarity to the collected voiceprint among the multiple reference voiceprints can be
  • the voiceprint is used as the matching voiceprint of the collected voiceprint, that is, it is determined that among the multiple reference voiceprints, the reference voiceprint with the greatest similarity to the collected voiceprint matches the collected voiceprint, while other reference voiceprints match the collected voiceprint.
  • the collected voiceprints do not match.
  • Voiceprint recognition can be used in the identification process.
  • one or more reference voiceprints of users are stored in the voiceprint recognition module, and the voiceprint recognition module can combine the collected voiceprint with the stored one or more reference voiceprints.
  • the user's identity information is determined according to the determined reference voiceprint, and the reference voiceprint matching the currently collected voiceprint is determined from one or more reference voiceprints.
  • different permissions corresponding to different users may be stored in the voiceprint recognition module.
  • the voiceprint recognition module determines the user's identity information according to the collected voiceprint, it can further determine the user's corresponding authority according to the user's identity information.
  • Voiceprint recognition can also be used in the identity verification process.
  • the reference voiceprint of one or more users who have passed the identity verification is stored in the voiceprint recognition module.
  • the voiceprint recognition module can combine the collected voiceprint with one or more stored Compare the reference voiceprint to determine whether there is a reference voiceprint that matches the collected voiceprint. If so, it can be determined that the user corresponding to the currently collected voiceprint can pass the identity verification; otherwise, it can be determined that the user corresponding to the currently collected voiceprint has not passed the authentication. Authenticated.
  • the voiceprint management device in this usage scenario may be a vehicle-mounted device (for example, a car machine, a vehicle-mounted computer, etc.), and the vehicle-mounted device may determine the identity information of the user corresponding to the currently collected voiceprint based on the reference voiceprint.
  • vehicle-mounted device for example, a car machine, a vehicle-mounted computer, etc.
  • the vehicle device can determine whether the user has passed the identity verification based on the reference voiceprint. Specifically, the vehicle-mounted device only provides the vehicle owner with the query function of "vehicle violation information". The vehicle-mounted device can store the reference voiceprint a of user A (vehicle owner), and mark that the reference voiceprint a corresponds to the vehicle owner. With reference to the scene shown in Fig. 2 (a), when user A inquires about vehicle violation information, it can be said "inquire about vehicle violation information", at this moment, the vehicle-mounted device can extract the sound in the voice signal "inquire about vehicle violation information".
  • the vehicle-mounted device can extract the voiceprint in user B's voice signal "query vehicle violation information", and determine the extracted If the voiceprint does not match the reference voiceprint a, user B is prompted that the query failed, for example, "Only limited to car owner query" is displayed on the display interface.
  • the vehicle-mounted device can determine different permissions corresponding to different users based on different user identity information. Specifically, car owners have the right to query vehicle violation information, while non-car owners do not have the right to query vehicle violation information.
  • the vehicle-mounted device determines that the user is the owner of the vehicle, it provides the user with the function of querying the vehicle violation information, and when the vehicle-mounted device determines that the user is not the vehicle owner, it refuses to provide the user with the function of querying the vehicle violation information.
  • the vehicle-mounted device determines different permissions corresponding to different users according to different user identity information, it is also possible to set the reference voiceprint corresponding to the driver in the vehicle-mounted device.
  • the on-vehicle device can determine that the current user is the driver, and correspondingly, can provide the current user with the authority corresponding to the driver, for example, can control the driving of the vehicle through voice signals.
  • the in-vehicle device provides different recommended content for different users.
  • the vehicle-mounted device can store the reference voiceprint a of user A, and record that user A likes rock music; and store the reference voiceprint b of user B, and record that user B likes light music.
  • the vehicle-mounted device can extract the voiceprint in the voice signal "turn on the music", and compare the extracted voiceprint with the stored reference The voiceprint a is compared with the reference voiceprint b to determine that the extracted voiceprint matches the reference voiceprint a, and a list of rock music is recommended in the display interface.
  • the vehicle-mounted device can extract the voiceprint in the voice signal "turn on the music”, and compare the extracted voiceprint with the stored reference
  • the voiceprint a is compared with the reference voiceprint b to determine that the extracted voiceprint matches the reference voiceprint b, and a light music list is recommended on the display interface.
  • the vehicle-mounted device determines that the extracted voiceprint matches the reference voiceprint a, it can also directly play rock music; or, when the vehicle-mounted device determines that the extracted voiceprint matches the reference voiceprint b , you can also play light music directly.
  • other implementation manners may also be used, which are not limited in this application.
  • the voiceprint management device may be a user terminal, such as a mobile phone.
  • the reference voiceprint of the organic main user is pre-stored in the mobile phone, and the mobile phone can determine whether the collected voiceprint matches the reference voiceprint, and then determine whether the user corresponding to the currently collected voiceprint is the main user.
  • the owner user can instruct the mobile phone to unlock and perform corresponding actions through voice signals.
  • the voiceprint management device needs to compare and collect voiceprints based on a relatively accurate reference voiceprint, and then perform corresponding actions based on the comparison result (matching or not matching). That is, the accuracy of the reference voiceprint stored in the voiceprint management device will affect the accuracy of voiceprint recognition.
  • the user's voiceprint may undergo short-term or long-term changes.
  • Short-term changes refer to reversible changes in the user's voiceprint due to temporary external stimuli, such as reversible changes in the user's voiceprint caused by a cold.
  • Permanent changes refer to the irreversible changes in the user's voiceprint caused by the physiological changes of the user.
  • the voiceprint management device needs to update the user's reference voiceprint based on the user's voiceprint that has changed for a long time, instead of updating the user's reference voiceprint based on the user's voiceprint that has undergone short-term changes, so as to improve the accuracy of the reference voiceprint , to improve the accuracy of voiceprint recognition.
  • the user's reference voiceprint is updated.
  • the present application exemplarily provides a voiceprint management method, which can be executed by a voiceprint management device.
  • the voiceprint management apparatus may be the voiceprint management device exemplarily shown in FIG. 1a.
  • the voiceprint management apparatus may be a terminal device, or a component (such as a processing device, a circuit, a chip, etc.) in the terminal device.
  • the terminal equipment includes vehicle-mounted equipment (such as a car machine, a vehicle-mounted computer, etc.), and a user terminal (such as a mobile phone, a computer, etc.).
  • the voiceprint management device may be a server, or a component (such as a processing device, circuit, chip, etc.) in the server.
  • the server may include physical devices such as hosts or processors, virtual devices such as virtual machines or containers, and may also include chips or integrated circuits.
  • the server is, for example, an IoV server, also known as a cloud server, cloud, cloud, cloud server, or cloud controller.
  • the IOV server may be a single server or a server cluster composed of multiple servers, which is not specifically limited.
  • the voiceprint management method based on the reference voiceprint and the reference similarity, it can be determined whether the voiceprint collected in the current time window has passed the verification, and then it can be determined whether the user's reference voiceprint needs to be updated.
  • the voiceprint management method is sequentially analyzed in order to obtain the reference voiceprint, obtain the reference similarity and voiceprint verification. The flow is explained below.
  • the following three processes are all for the same user (or called registrant, speaker, etc.), obtain the user's reference voiceprint and reference similarity, and verify the user's voiceprint. Determine whether the user's reference voiceprint needs to be updated.
  • the same user can be determined based on technologies such as voiceprint comparison and face recognition.
  • the similarity threshold can be used to indicate whether two voiceprints correspond to the same user. It is understood that, regardless of whether the user's voiceprint changes, the similarity between the user's collected voiceprint and the reference voiceprint The degrees are greater than the similarity threshold; while the similarity between the voiceprints of different users is less than the similarity threshold.
  • the reference voiceprint registered by user A is reference voiceprint a
  • the reference voiceprint registered by user B is reference voiceprint b
  • the similarity between reference voiceprint a and reference voiceprint b is less than the similarity threshold.
  • the similarity between the determined voiceprint and the reference voiceprint a is greater than the similarity threshold, while the similarity with the reference voiceprint b is less than the similarity threshold, and the determined The collected voiceprint corresponds to user A, not user B.
  • the collected voice signal is the user's voice signal
  • the voiceprint extracted according to the user's voice signal is the user's voiceprint. It can be determined based on the user's mouth shape that the user is speaking. The user's mouth shape can be opened and/or closed according to a preset rule. If a voice signal corresponding to the preset rule is collected, it can be determined that the current user is speaking, and the acquired What is received is the voice signal sent by the user.
  • the user may also be identified in other ways.
  • a user account can be set, and when the user logs in through the account and sends a voice signal, it can be determined that the obtained voice signal and the user corresponding to the account login are the same user.
  • the account number can also be replaced by the user's fingerprint.
  • the user sends a voice signal after logging in through the fingerprint it can be determined that the obtained voice signal and the user corresponding to the fingerprint login are the same user.
  • one or more of the above-mentioned voiceprint comparison, face recognition, and account verification can also be combined to determine the same user, so as to improve the accuracy of determining the same user.
  • the same user targeted is referred to as the first user.
  • the first user may register a voiceprint at a registration time point (t0 as shown in FIG. 5(a)), where the voiceprint registered by the first user may be called a registered voiceprint or a reference voiceprint.
  • a preset text may be displayed on the display interface, and the first user reads the preset text to obtain the current voice signal of the first user, and extracts the voiceprint based on the voice signal to obtain the first user's voiceprint.
  • the reference voiceprint of the first user is stored.
  • the application can further obtain the reference similarity of the second preset duration after the registration time point,
  • the benchmark similarity and the benchmark voiceprint can be used together as a benchmark parameter to determine whether the benchmark voiceprint needs to be updated. For details, please refer to the description in the following voiceprint verification process.
  • the period after the registration time point and corresponding to the second preset duration may be referred to as a reference time period, a reference time window, a second time window, and the like.
  • the second time window can perform a sliding operation under specific circumstances, and the actual duration of the second time window is variable.
  • the voice signal of the first user in the second time window (t0-t1 as shown in Figure 5 (a)) can be acquired, and the voiceprint is extracted from the voice signal, and the voiceprint is the The voiceprint of the first user collected in the second time window (may be referred to as the second voiceprint).
  • the second voiceprint can be compared with the reference voiceprint.
  • the similarity between the second voiceprint and the reference voiceprint can be determined according to a similarity algorithm (such as cosine similarity algorithm, Euclidean distance algorithm, etc.).
  • multiple second voiceprints may be collected in the second time window, that is, similarities between the multiple second voiceprints and the reference voiceprint may be determined respectively.
  • an average value among the multiple similarities may be used as a reference similarity corresponding to the second time window (which may be referred to as an average similarity).
  • the median of the multiple similarities, or the average value between the maximum value and the minimum value may also be used as the reference similarity degree corresponding to the second time window, or in other ways.
  • the average similarity corresponding to the second time window is used as an example for illustration.
  • the average similarity can be replaced by the benchmark similarity to represent the same meaning.
  • a second threshold can be preset.
  • the duration of the second time window may be automatically extended until it is determined that the number of second voiceprints in the second time window reaches the second times threshold.
  • the threshold for the second number of times is 10, which is only 8 second voiceprints collected within t0-t1 shown in (a) of FIG. 5 .
  • the duration of the second time window can be automatically extended, for example, the 10th second voiceprint is not collected until t1' is reached, then the second time window can be determined
  • the termination point of is t1', that is, the second time window is t0-t1'.
  • the average similarity in the second time window may be determined based on the similarities between the 10 second voiceprints and the reference voiceprint respectively.
  • multiple voiceprints of the first user in the verification time period may be acquired, and based on the acquired multiple voiceprints, it is determined whether the voiceprint of the first user has changed permanently, That is, whether it is necessary to update the reference voiceprint of the first user based on the voiceprint of the first user after a long-term change.
  • the starting point of the first time window is later than the starting point of the second time window, and the first time window may partially overlap or not overlap with the second time window.
  • the first time window may be set as a first preset duration, and the first preset duration may be equal to or different from the second preset duration. Further, the first time window may perform a sliding operation under specific circumstances, and the actual duration of the first time window is variable.
  • FIG. 6 exemplarily shows a schematic flow diagram of voiceprint verification, in this flow:
  • Step 601 acquire a first speech signal in a first time window.
  • the first voice signal of the first user in the first time window (t1-t2 as shown in Figure 5(a)) can be obtained, and according to the voiceprint extraction process exemplarily shown in Figure 1c, from the first voice signal
  • the first voiceprint of the first user is extracted from the signal.
  • Step 602 extracting the first voiceprint from the first voice signal, and determining the verification result of the first voiceprint.
  • the first voiceprint can be compared with the reference voiceprint to determine the verification result of the first voiceprint. Specifically, the similarity between the first voiceprint and the reference voiceprint can be determined, and when the similarity between the first voiceprint and the reference voiceprint is greater than the average similarity corresponding to the second time window, record the first voiceprint If the similarity between the first voiceprint and the reference voiceprint is not greater than the average similarity corresponding to the second time window, it is recorded that the first voiceprint has failed the verification.
  • a bit can be used to record whether the corresponding first voiceprint passes the verification, for example, when the first voiceprint passes the verification, the value of the corresponding bit is 1; when the first voiceprint When the verification is not passed, the value of the corresponding bit of the record is 0.
  • there is no need to store a plurality of first voiceprints but only verification results of a plurality of first voiceprints need to be stored, and each verification result can occupy one bit, thereby helping to reduce storage space.
  • Step 603 Determine whether to update the reference voiceprint according to the verification result of the first voiceprint in the first time window. If no update is required, the voiceprint verification process can be performed again. If an update is required, the voiceprint update process can be performed.
  • Multiple first voiceprints may be collected in the first time window, that is, verification results of multiple first voiceprints may be determined. Whether to update the reference voiceprint of the first user may be determined based on the verification results of multiple first voiceprints.
  • the ratio of the verified first voiceprints (which may be referred to as a compliance rate) may be counted, that is, the ratio of the number of verified first voiceprints to the number of first voiceprints may be counted.
  • the ratio is greater than the ratio threshold, which means that the current voiceprint of the first user has not changed for a long time, and there is no need to update the reference voiceprint of the first user.
  • the ratio is less than or equal to the ratio threshold, which means that the current voiceprint of the first user has undergone permanent changes, and the reference voiceprint of the first user needs to be updated.
  • the ratio threshold is 70%.
  • a total of 5 first voiceprints are obtained in the first time window, and according to the 5 first voiceprints and the reference voiceprint, the similarities between the 5 first voiceprints and the reference voiceprint are determined to be, for example, 0.90, 0.90, 0.90, 0.80, 0.86, then the verification results of the five first voiceprints can be obtained as 1, 1, 1, 0, 1 respectively, and then the ratio of the first voiceprints that passed the verification in the first time window is determined to be 80%. (greater than 70% of the ratio threshold), there is no need to update the reference voiceprint of the first user.
  • the first number threshold can be preset, and when it is determined that the number of verification results of the first voiceprint in the first time window of the first preset duration is less than the first number threshold , the duration of the first time window may be automatically extended until it is determined that the number of verification results of the first voiceprint in the first time window reaches the first number threshold.
  • the threshold of the first number is 10, compared to the 8 first voiceprints collected within t1-t2 shown in (a) in Figure 5, the duration of the first time window can be automatically extended, such as The tenth first voiceprint is not collected until t2' is reached, and it can be determined that the end point of the first time window is t2', that is, the first time window is t1-t2'. Further, it may be determined whether to update the reference voiceprint based on the verification results of the 10 first voiceprints.
  • the average similarity of is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint.
  • the average similarity in the first time window may be used for subsequent updating of the reference voiceprint, for details, please refer to the following embodiments.
  • Step 604 slide the first time window.
  • the first time window may be slid backward for a third preset duration, and the third preset duration may be shorter than the first preset duration, or may be shorter than the second preset duration.
  • the end point of the first time window after sliding is later than the end point of the first time window before sliding.
  • the first time window slides from t1-t2 before sliding to t3-t4, where the interval between t2 and t4 is the third preset duration.
  • the first preset duration may be 7 days
  • the second preset duration may be 7 days
  • the third preset duration may be 1 day.
  • the length of the first time window may be extended until the number of verification results of the first voiceprint in the first time window reaches the first number threshold, that is, the length of the first time window is variable Yes, there may be a situation where the duration of the first time window before sliding is greater than the first preset duration.
  • the end point of the first time window can be slid backward for a third preset duration, and then the first time after sliding can be determined based on the end point of the first time window after sliding and the first preset duration The starting point of the window.
  • the end point of the first time window is moved back to t2' (the interval between t1 and t2' is greater than the first preset time length).
  • the time point corresponding to t2'+the third preset duration can be used as the end point of the first time window after sliding, for example, it is expressed as t4'.
  • it can be determined that the starting point of the first time window after sliding is t4'-the first preset duration, for example, it is expressed as t3'.
  • the first voiceprint of the first user may be continuously collected in the first time window after sliding, the similarity between the collected first voiceprint and the reference voiceprint is determined, and then the verification result of the first voiceprint is determined.
  • the verification results of multiple first voiceprints may be determined in the first time window after sliding, and it is determined whether the reference voiceprint of the first user needs to be updated according to the verification results of the multiple first voiceprints. That is, after step 604, the above steps 601 to 603 may be continued until the reference voiceprint of the first user is updated.
  • the first time window may gradually slide from t1-t2 to t5-t6. Further, for example, when sliding the first time window to t5-t6, according to the verification results of multiple first voiceprints obtained in t5-t6, it is determined that the voiceprint update process needs to be executed, that is, step 605 can be started at t6 Go to step 606, the voiceprint update process.
  • Step 605 acquire the second voice signal, wherein the difference between the similarity between the second voice signal and the reference voiceprint and the average similarity in the first time window is less than the difference threshold.
  • the voice signal of the first user can be collected to determine whether the voice signal meets the preset condition, and if so, the voice signal is used to update the reference voiceprint, otherwise, the voice signal is discarded.
  • the voiceprint is determined according to the collected voice signal, and the similarity between the voiceprint and the reference voiceprint is determined. If the difference between the similarity and the average similarity in the first time window is smaller than the difference threshold, the collected voice signal (ie, the second voice signal) may be used to update the reference voiceprint. If the difference between the similarity and the average similarity in the first time window is not less than the difference threshold, the collected speech signal may be discarded.
  • the difference threshold is 0.1.
  • Step 606 Update the reference voiceprint according to the second voice signal.
  • one or more second voice signals may be obtained after the voiceprint update process is started.
  • a deduplication operation, or a deduplication operation and splicing operation may be performed on one or more second voice signals to obtain the deduplication operation, or the voice signal after the deduplication operation and splicing operation, and then the voiceprint is extracted therefrom.
  • ASR may be performed on the second speech signal to obtain the text corresponding to the second speech signal
  • a deduplication operation is performed on the text corresponding to the second speech signal to obtain the non-existent text.
  • Text that repeats text may be referred to as deduplicated text.
  • a voice signal after the deduplication operation (which may be called a deduplication voice signal) is obtained.
  • an update condition can be preset, and when the de-emphasis voice signal meets the update condition, the reference voiceprint can be updated according to the de-emphasis voice signal.
  • the update condition may be that the duration of the de-emphasized speech signal is greater than the duration threshold, and/or the number of characters in the non-repetitive text (also referred to as de-duplicated text) corresponding to the de-emphasized voice signal is greater than the word-count threshold.
  • the update condition is that the duration of the de-emphasized speech signal is greater than the duration threshold, and that the number of words in the non-repetitive text corresponding to the de-emphasized speech signal is greater than the word-number threshold.
  • the process of updating the reference voiceprint is started at t6, and then one or more second voice signals are collected, until t7 to obtain a de-emphasized voice signal that meets the update condition, and at t7 according to the de-duplication Heavy speech signals update the reference voiceprint.
  • t6-t7 may be understood as a third time window, and the third time window is used to acquire one or more second voice signals.
  • step 605 and step 606 are explained below in combination with a schematic flow chart of updating a reference voiceprint provided in FIG. 8 .
  • Step 801 acquire the first second voice signal.
  • Step 802 according to the first second voice signal, determine the de-emphasis voice signal corresponding to the first second voice signal.
  • the corresponding deduplication text may be referred to as the first deduplication text).
  • the first user wants to wake up a certain device, and the wake-up word of the device is "Little A”
  • the first user can send the second voice signal "Little A, Little A”
  • ASR processing is performed on the second speech signal to obtain the text corresponding to the second speech signal, that is, " ⁇ A ⁇ A”.
  • the deduplication operation may be performed, and the obtained first deduplicated text is "Little A”.
  • the first user sends a second voice signal "Little A, little A, please turn on the air conditioner”
  • the first deduplication text obtained according to the second voice signal is "Little A, please turn on the air conditioner”.
  • the first user sends the second voice signal "Hello, little A”, or “Little A, please turn on the air conditioner", or “Little A, please Turn on the music” etc.
  • the ASR is performed on the second speech signal, it is determined that there is no repeated text in the text corresponding to the second speech signal, and the text corresponding to the second speech signal is directly used as the first deduplication text.
  • the speech signal corresponding to the first deduplication text in the first second speech signal can be determined, so as to obtain the speech signal after the deduplication operation is performed on the first second speech signal (that is, the first deemphasis speech signal ).
  • the first deduplication text can be determined according to the corresponding voice signal segments in the first second voice signal of the first deduplication text "Little A, please turn on the air conditioner".
  • De-emphasized speech signals In combination with Table 1, the speech signal segment 11 corresponding to “little A (the first)" and the speech signal segment 13 corresponding to “please turn on the air conditioner” are concatenated to obtain the first de-emphasis speech signal.
  • the speech signal segment corresponding to "little A" in the first second speech signal in the first deduplication text can be the speech signal segment 11 or the speech signal segment 12, then in When splicing, either the speech signal segment 11 and the speech signal segment 13 can be selected to be spliced to obtain the first de-emphasis speech signal, or the speech signal segment 12 and the speech signal segment 13 can be selected to be spliced to obtain the first de-emphasis speech signal.
  • the above is only an example of how to obtain the de-emphasized voice signal from the second voice signal according to the de-emphasized text.
  • the second voice signal may also be as shown in Table 2.
  • the second voice signal may also be divided into other manners, which are not limited in this application.
  • Speech signal segment Small A (first) Speech Signal Segment 11 Small A (second) Speech signal segment 12 please open Speech signal segment 14 air conditioner Speech signal segment 15
  • step 803 if it is determined that the first de-emphasized speech signal meets the update condition, then step 807 is executed. If it is determined that the first de-emphasized speech signal does not meet the update condition, step 804 is performed.
  • Step 804 acquiring the second second voice signal.
  • Step 805 according to the second second voice signal and the first de-emphasis text, determine the de-emphasis voice signal corresponding to the first second voice signal and the second second voice signal.
  • the first deduplication The repeated text may be used as the historical deduplicated text of the second second speech signal.
  • ASR processing is performed on the second second speech signal, and the text corresponding to the second second speech signal is obtained, the text corresponding to the second second speech signal is deduplicated according to the history (first deduplication text).
  • the deduplication operation is performed on the text to obtain the deduplication text (which may be referred to as the second deduplication text) corresponding to the first second voice signal and the second second voice signal.
  • the first deduplication text is "little A, please turn on the air conditioner”.
  • the second second voice signal is "Little A, little A, the high point of the air conditioner”, execute ASR on the second second voice signal, and get the text "Little A, small A, the high point of the air conditioner", after performing the deduplication operation Get the text "Little A's air conditioner turns on high”.
  • the second deduplication text is "little A, please turn on the air conditioner at a high point”.
  • the text corresponding to the second second speech signal may be spliced to the first deduplication text, and then the deduplication operation is performed on the spliced text to obtain the second deduplication text.
  • the first deduplication text is "little A, please turn on the air conditioner".
  • the second second voice signal is "little A, little A, the air conditioner is turned on high”
  • ASR is performed on the second second voice signal to obtain the text "little A, small A, the air conditioner is turned on high”.
  • the speech signal corresponding to the second deduplication text in the first second speech signal and the speech signal corresponding to the second second speech signal can be determined, so as to obtain the first
  • the second voice signal and the second second voice signal are de-emphasized and spliced to the voice signal (that is, the second de-emphasized voice signal).
  • the first second voice signal above is “Little A, little A, please turn on the air conditioner”
  • the second second voice signal is "Little A, little A, turn on the air conditioner”.
  • Table 1 shows the corresponding relationship between "little A, little A, please turn on the air conditioner” and the speech signal segment in the first second speech signal.
  • Table 3 shows the corresponding relationship between "small A, small A, high point of air conditioner” and the speech signal segment in the second second speech signal.
  • the corresponding voice signal segment in the first second voice signal can be selected according to the second deduplication text "Please turn on the high point of the air conditioner” , and the corresponding speech signal segment in the second second speech signal to determine the second de-emphasis speech signal.
  • the voice signal segment 13 corresponding to "please turn on the air conditioner” and the voice signal segment 24 corresponding to "high point” are spliced into the second deduplication voice signal.
  • the determined second de-emphasis voice signal may also be formed by concatenating the voice signal segment 12 , the voice signal segment 13 and the voice signal segment 24 .
  • step 807 After the second de-emphasis speech signal is acquired, it may be determined whether the second de-emphasis speech signal meets the update condition (that is, the judging condition in step 803 is performed again), and if so, step 807 is performed.
  • the historical deduplication text (the second deduplication text) performs a deduplication operation on the text corresponding to the 3rd second voice signal, and obtains the deduplication text corresponding to the 1st second voice signal to the 3rd second voice signal ( may be referred to as the third deduplication text).
  • a de-emphasis speech signal (may be referred to as a third de-emphasis speech signal) is determined from the first second speech signal to the third second speech signal. Determine whether the third de-emphasis voice signal meets the update condition.
  • step 806 can be performed to obtain the i-th second voice signal, and determine the first to second voice signals from the first second voice signal to the i-th second voice signal according to the i-th second voice signal and the historical deduplication text
  • the first deduplication text is "Hello Little A”
  • the number of characters is 4, and the duration of the first deduplication voice signal is 1.5s.
  • the number of words in the first de-emphasis text is not more than 12, and the duration of the first de-emphasis voice signal is not more than 5s, and the second second voice signal can be further obtained.
  • step 807 is performed according to the sixth de-emphasis speech signal.
  • the second voice signal deduplicated text word count duration 1 Hello Little A Hello Little A 4 1.5s 2 on the aircon Hello little A, turn on the air conditioner 8 3s 3 air conditioner high Hello little A, turn on the air conditioner higher 10 3.5s 4 Hello Little A Hello little A, turn on the air conditioner higher 10 3.5s 5 open the sunroof Hello little A, turn on the air-conditioning high point sunroof 12 4s 6 play music Hello Little A, turn on the air conditioner and play music on the high sunroof 16 5.5s
  • Step 807 Determine the third voiceprint of the first user according to the de-emphasized voice signal meeting the update condition.
  • the voiceprint extraction process exemplarily shown in FIG. 1c is executed according to the de-emphasized voice signal meeting the update condition, to obtain the third voiceprint of the first user.
  • Step 808 update the reference voiceprint of the first user according to the third voiceprint.
  • the reference voiceprint of the first user may be actively updated, that is, the original reference voiceprint is replaced by the third voiceprint.
  • the first user may be prompted whether to update the reference voiceprint.
  • the first user is user A (car owner), and the vehicle-mounted device determines that the voiceprint of user A has changed permanently, then it can display in the display interface Prompt user A whether to automatically update the reference voiceprint.
  • the in-vehicle device displays a prompt message "It is detected that your voiceprint has changed, do you want to update the detected voiceprint to the original voiceprint?" on the display interface. If user A clicks "OK”, the in-vehicle device can replace the original reference voiceprint with the obtained third voiceprint. If user A clicks "No, I want to update by myself", the vehicle-mounted device can further display a preset text on the display interface, and prompt user A to read the preset text, and obtain the current voice signal of user A, based on the The voiceprint is extracted from the voice signal to obtain the reference voiceprint of user A, and the reference voiceprint of user A is stored.
  • the user A may be prompted on the display interface to update the reference voiceprint by himself.
  • the second time window can be slid to obtain a slid second time window.
  • the starting point of the second time window after sliding is not earlier than the time point of updating the reference voiceprint; or, the starting point of the second time window after sliding is not earlier than the time point of acquiring the second voice signal.
  • the starting point of the second time window may be after the ending point of the third time window, or coincide with the ending point of the third time window. As shown in (e) of FIG. 5 , the starting point of the second time window after sliding coincides with the ending point of the third time window.
  • One or more second voiceprints in the second time window after sliding may be further obtained in the second time window after updating the reference voiceprint (ie, the second time window after sliding), according to one or more first time windows. The similarities between the two voiceprints and the updated reference voiceprint respectively determine the average similarity in the second time window after sliding.
  • the first time window is slid, and the end point of the slid first time window is later than the start point of the slid second time window.
  • the starting point of the first time window after sliding may be after the ending point of the second time window after sliding, or coincide with the ending point of the second time window after sliding.
  • the starting point of the first time window after sliding coincides with the ending point of the second time window after sliding.
  • the first time window after sliding is specifically t8-t9, wherein the interval between t8-t9 is a first preset time length.
  • Obtain one or more first voiceprints in the first time window after sliding, according to the similarity between one or more first voiceprints and the updated reference voiceprint and the average similarity of the second time window after sliding Determine the verification results of one or more first voiceprints, and then determine whether to update the reference voiceprint again.
  • the voice signal of the first user is acquired, and the voice signal of the first user is pre-processed to obtain audio feature information of the voice signal. Perform audio feature extraction according to the audio feature information to determine the voiceprint of the first user. Perform post-processing to obtain the final voiceprint of the first user. In one case, register or update the voiceprint of the first user as a reference voiceprint. In another case, it may be determined that the current time point is located in the first time window, or in the second time window.
  • the voiceprint of the first user that is, the first voiceprint
  • the reference voiceprint to obtain the similarity between the first voiceprint and the reference voiceprint, if the If the similarity is greater than the average similarity in the second time window, it is determined that the first voiceprint has passed the verification.
  • Determine the ratio of the verified first voiceprint in the first time window i.e. the compliance rate in the first time window
  • the compliance rate is greater than the ratio threshold
  • slide the first time window if the compliance rate is less than or equal to the ratio threshold, Then start the process of updating the reference voiceprint.
  • the voiceprint of the first user (that is, the second voiceprint) is compared with the reference voiceprint to obtain the similarity between the second voiceprint and the reference voiceprint.
  • an average value of the similarities between the multiple second voiceprints and the reference voiceprint is determined as the average similarity.
  • thresholds are involved in this application, such as ratio threshold, similarity threshold, difference threshold, number of times threshold, duration threshold, word count threshold, etc.
  • the threshold when it is greater than or equal to the threshold, it corresponds to the first result; when it is less than the threshold, it corresponds to the second result; it can also be when it is greater than the threshold,
  • Corresponding to the first result may be when it is less than or equal to the threshold, corresponding to the second result, which is not limited in this application.
  • the ratio when the ratio is greater than the ratio threshold, there is no need to update the first user's reference voiceprint (i.e.
  • the first result When it is equal to the ratio threshold, update the first user’s reference voiceprint (i.e. the second result), but in this application it is also possible that when the ratio is greater than or equal to the ratio threshold, there is no need to update the first user’s reference voiceprint (i.e. the first result). ); when the ratio is less than the ratio threshold, update the first user's reference voiceprint (that is, the second result).
  • the semantic understanding device may execute ASR to obtain the text corresponding to the second voice signal, and then the voiceprint management device obtains the text corresponding to the second voice signal from the semantic understanding device
  • the text corresponding to the second voice signal obtained in the present application is not limited.
  • the reference voiceprint of the first user is determined in advance, and then the second voiceprint of the first user is obtained in the second time window, and the similarity between the second voiceprint and the reference voiceprint is determined.
  • Multiple similarities in two time windows determining an average similarity in the second time window. Since the second time window is after the time point when the reference voiceprint is stored, that is, during the time period from the time point when the user registers the reference voiceprint to the end point of the second time window, the user's voiceprint is in a steady change, which can be based on The voiceprint of the first user that is changing steadily is used to obtain a reference parameter (that is, the average similarity between the reference voiceprint and the second time window).
  • the similarity between the first voiceprint and the reference voiceprint can be determined based on the user's first voiceprint in the first time window, and the similarity can be compared with the second Compare the average similarity of the time window to determine the verification result of the first voiceprint, that is, determine whether the first voiceprint in the first time window is similar to the second voiceprint in the second time window, and then determine the start of the second time window During the time period from the start point to the end point of the first time window, whether the user's voiceprint has undergone a long-term change.
  • the verification results of multiple first voiceprints in the first time window are determined, and the ratio of the first voiceprints that pass verification in the first time window (that is, the compliance rate in the first time window) is determined.
  • the compliance rate in the time window determines whether the user's voiceprint has changed for a long time, avoiding misjudgment caused by the user's accidental reasons in the first time window, and helping to improve the accuracy of the judgment.
  • the first time window can be slid to determine the ratio of the verified first voiceprint in the first time window after sliding (that is, the compliance rate in the first time window after sliding) , according to the compliance rate in the first time window after sliding, it is determined whether the voiceprint of the user has undergone a permanent change, which is helpful to timely discover the permanent change of the voiceprint of the first user and update the registered voiceprint of the first user.
  • the second voice signal is acquired, and the voiceprint of the first user is determined according to the acquired second voice signal, without interacting with the user, so that the user has no perception, which helps to improve user experience.
  • there is no need to pre-store multiple first voiceprints in the first time window which helps to reduce the amount of storage.
  • there is no need to store multiple second voiceprints in the second time window and only the average similarity of the second time window needs to be saved, further reducing the storage capacity.
  • the methods and operations implemented by the voiceprint management device may also be implemented by components (such as chips or circuits) that can be used in the voiceprint management device.
  • each functional module in each embodiment of the present application may be integrated into one processor, or physically exist separately, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.
  • Fig. 11 and Fig. 12 are schematic structural diagrams of possible voiceprint management devices exemplarily provided by the present application. These voiceprint management devices can be used to realize the functions of the above-mentioned method embodiments, and therefore can also realize the beneficial effects possessed by the above-mentioned method embodiments.
  • the voiceprint management device includes: an acquisition module 1101 and a processing module 1102 .
  • the acquisition module 1101 can be used to perform the acquisition function of the voiceprint management device in the relevant method embodiments in FIG. 6, FIG. 8 or FIG. Steps for speech signals, etc.
  • the processing module 1102 can be used to execute the processing functions of the voiceprint management device in the relevant method embodiments in FIG. 6, FIG. 8 or FIG. A voiceprint judgment step, or a step for updating a reference voiceprint, etc.
  • the processing module 1102 is configured to update the voiceprint according to the verification result of the first voiceprint in the first time window Baseline voiceprint.
  • the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including: When the similarity is greater than the average similarity of the second time window, the verification result of the first voiceprint is verified.
  • the processing module 1102 is specifically configured to: update the reference voiceprint according to the ratio of the verified first voiceprint in the first time window.
  • the processing module 1102 is specifically configured to: update the reference voiceprint when the ratio of the verified first voiceprint in the first time window is less than or equal to a ratio threshold.
  • the processing module 1102 is specifically configured to: control the acquisition module 1101 to acquire the second voice signal of the first user, wherein the similarity between the second voice signal and the reference voiceprint is the same as that of the first time window.
  • the difference of the average similarity is less than the difference threshold, and the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; according to the second voice signal, Update the baseline voiceprint.
  • the processing module 1102 is specifically configured to: obtain the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, wherein the text corresponding to the de-emphasis voice signal is non-repetitive text; Deduplicate the voice signal and update the reference voiceprint.
  • the processing module 1102 is specifically configured to: perform a deduplication operation on the text corresponding to the second voice signal to obtain deduplication text;
  • the corresponding speech signal in the middle is obtained to obtain the de-emphasis speech signal.
  • the processing module 1102 is specifically configured to: for the ith second voice signal among the multiple second voice signals:
  • the text corresponding to the second voice signal performs a deduplication operation; after the deduplication operation, the text corresponding to the ith second voice signal is spliced with the historical deduplication text to obtain the 1st voice signal to the ith second voice
  • the deduplication text corresponding to the signals wherein the historical deduplication text is obtained according to the text corresponding to the 1st second voice signal to the i-1th second voice signal, i is greater than 1; according to the common
  • the corresponding de-emphasis text is a corresponding voice signal among the plurality of second voice signals to obtain a de-emphasis voice signal.
  • the duration of the de-emphasis speech signal is greater than the duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than the word-number threshold.
  • the processing module 1102 is further configured to: slide the first time window, wherein an end point of the first time window after sliding is later than an end point of the first time window before sliding.
  • the length of the first time window is variable.
  • the processing module 1102 is further configured to: slide the second time window, where the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the end point of the second time window after sliding is The start point of the second time window of is earlier than the start point of the first time window after sliding.
  • the processing module 1102 is further configured to: Determine the similarity in the second time window of the second voiceprint; determine the average similarity in the second time window according to the multiple similarities in the second time window.
  • the device provided by the embodiment of the present application is shown.
  • the device shown in FIG. 12 may be a hardware circuit implementation manner of the device shown in FIG. 11 .
  • the apparatus may be applicable to the flow chart shown above to execute the above-mentioned method embodiment.
  • FIG. 12 For ease of illustration, only the main components of the device are shown in FIG. 12 .
  • the voiceprint management device includes: a processor 1210 and an interface 1230 , and optionally, the voiceprint management device further includes a memory 1220 .
  • the interface 1230 is used to implement communication with other devices.
  • the methods performed by the voiceprint management device in the above embodiments can be implemented by calling the program stored in the memory (which may be the memory 1220 in the voiceprint management device or an external memory) by the processor 1210. That is, the voiceprint management device may include a processor 1210, and the processor 1210 executes the method performed by the voiceprint management device in the above method embodiments by calling a program in the memory.
  • the processor here may be an integrated circuit with signal processing capabilities, such as a CPU.
  • the voiceprint management device can be realized by one or more integrated circuits configured to implement the above method. For example: one or more ASICs, or one or more microprocessors DSP, or one or more FPGAs, etc., or a combination of at least two of these integrated circuit forms. Alternatively, the above implementation manners may be combined.
  • the functions/implementation process of the processing module 1102 and the obtaining module 1101 in FIG. 11 can be realized by calling the computer-executed instructions stored in the memory 1220 by the processor 1210 in the voiceprint management device shown in FIG. 12 .
  • the present application provides a computer program product, the computer program product includes computer programs or instructions, and when the computer program or instructions are executed by a computing device, the methods in the above method embodiments are implemented.
  • the present application provides a computer-readable storage medium, in which computer programs or instructions are stored, and when the computer programs or instructions are executed by a computing device, the methods in the above-mentioned method embodiments are implemented .
  • the present application provides a computing device, including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the computing device performs the above method Methods in the Examples.
  • an embodiment of the present application provides a chip system, including: a processor and a memory, the processor is coupled to the memory, and the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the The chip system implements the methods in the foregoing method embodiments.
  • the chip system further includes an interface circuit for exchanging code instructions to the processor.
  • processors in the chip system, and the processors may be implemented by hardware or by software.
  • the processor may be a logic circuit, an integrated circuit, or the like.
  • the processor may be a general-purpose processor implemented by reading software codes stored in a memory.
  • the memory can be integrated with the processor, or can be set separately from the processor.
  • the memory may be a non-transitory processor, such as a read-only memory ROM, which may be integrated with the processor on the same chip, or may be respectively provided on different chips.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) having computer-usable program code embodied therein.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voiceprint management method and apparatus, which are used to timely and accurately update a reference voiceprint in a voiceprint management apparatus, thereby improving recognition accuracy in voiceprint recognition. The method comprises: acquiring a verification result of a first voiceprint in a first time window, wherein the verification result of the first voiceprint is determined according to the first voiceprint, a reference voiceprint and the average similarity of a second time window, the first voiceprint is determined according to a first voice signal of a first user acquired in the first time window, the average similarity of the second time window is used to indicate the similarity between a voiceprint of the first user acquired in the second time window and the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window; and updating the reference voiceprint according to the verification result of the first voiceprint in the first time window.

Description

一种声纹管理方法及装置A voiceprint management method and device 技术领域technical field
本申请涉及语音识别技术领域,尤其涉及一种声纹管理方法及装置。The present application relates to the technical field of speech recognition, in particular to a voiceprint management method and device.
背景技术Background technique
随着技术的迭代和法律规范的完善,声纹识别技术在多个场景中,比如车辆网场景、智能家居场景、业务办理场景等均得到广泛使用。声纹识别技术是将声纹管理设备中已经存储的声纹和采集声纹作比对,来判断用户的身份信息。With the iteration of technology and the improvement of legal regulations, voiceprint recognition technology has been widely used in many scenarios, such as vehicle network scenarios, smart home scenarios, and business processing scenarios. The voiceprint recognition technology is to compare the voiceprint stored in the voiceprint management device with the collected voiceprint to judge the identity information of the user.
声纹管理设备中预先存储有用户的基准声纹(即用户注册至声纹管理设备中的声纹)。在声纹识别中,当声纹管理设备获取到该用户的采集声纹时,可以将采集声纹与基准声纹作比对,进而确定采集声纹与基准声纹对应于同一个用户。The user's reference voiceprint (that is, the voiceprint registered by the user in the voiceprint management device) is pre-stored in the voiceprint management device. In voiceprint recognition, when the voiceprint management device acquires the collected voiceprint of the user, it can compare the collected voiceprint with the reference voiceprint, and then determine that the collected voiceprint and the reference voiceprint correspond to the same user.
但用户声纹可能会随着时间的变化而变化,为保证声纹识别的准确性,需要对声纹管理设备中的基准声纹进行及时更新。However, the user's voiceprint may change with time. In order to ensure the accuracy of voiceprint recognition, it is necessary to update the reference voiceprint in the voiceprint management device in time.
发明内容Contents of the invention
本申请提供一种声纹管理方法及装置,用于及时准确的更新声纹管理设备中的基准声纹,进而在声纹识别中提高识别的准确性。The present application provides a voiceprint management method and device, which are used to timely and accurately update a reference voiceprint in a voiceprint management device, thereby improving recognition accuracy in voiceprint recognition.
第一方面,本申请提供一种声纹管理方法,该方法可通过终端设备执行,终端设备比如车载设备(如车机、车载电脑等)、用户终端(如手机、电脑等)。该声纹管理方法也可以由终端设备的部件实现,如由终端设备中的处理装置、电路、芯片等部件实现,例如终端设备中系统芯片或处理芯片。其中,系统芯片也称为片上系统,或称为片上系统(system on chip,SOC)芯片。In the first aspect, the present application provides a voiceprint management method, which can be executed by a terminal device, such as a vehicle-mounted device (such as a car machine, a vehicle-mounted computer, etc.), and a user terminal (such as a mobile phone, a computer, etc.). The voiceprint management method can also be implemented by components of the terminal equipment, such as processing devices, circuits, chips and other components in the terminal equipment, such as system chips or processing chips in the terminal equipment. Wherein, the system chip is also called a system on a chip, or a system on chip (system on chip, SOC) chip.
该方法还可通过服务器执行,服务器可以包括主机或处理器等实体设备,也可以包括虚拟机或容器等虚拟设备,还可以包括芯片或集成电路。服务器比如是车联网服务器,也称为云服务器、云、云端、云端服务器或云端控制器等,该车联网服务器可以是单个服务器,也可以是由多个服务器构成的服务器集群,具体不作限定。此外,该方法还可通过服务器的部件执行,如由服务器中的处理装置、电路、芯片等部件实现。The method may also be executed by a server, and the server may include physical devices such as hosts or processors, or virtual devices such as virtual machines or containers, and may also include chips or integrated circuits. The server is, for example, an IoV server, also known as a cloud server, cloud, cloud, cloud server, or cloud controller. The IOV server may be a single server or a server cluster composed of multiple servers, which is not specifically limited. In addition, the method may also be executed by components of the server, for example, implemented by components such as processing devices, circuits, and chips in the server.
在一种可能的实现方式中,本申请提供的声纹管理方法包括:获取第一时间窗中第一声纹的验证结果,其中,第一声纹的验证结果是根据第一声纹、基准声纹、第二时间窗的平均相似度确定的,第一声纹是根据在第一时间窗中获取的第一用户的第一语音信号确定的,第二时间窗的平均相似度用于指示在第二时间窗中获取的第一用户的声纹与基准声纹的相似情况,第一时间窗的起始点晚于第二时间窗起始点;根据第一时间窗中第一声纹的验证结果,更新基准声纹。In a possible implementation manner, the voiceprint management method provided by the present application includes: acquiring the verification result of the first voiceprint in the first time window, wherein the verification result of the first voiceprint is based on the first voiceprint, benchmark The average similarity of the voiceprint and the second time window is determined. The first voiceprint is determined according to the first voice signal of the first user acquired in the first time window. The average similarity of the second time window is used to indicate In the similarity between the voiceprint of the first user obtained in the second time window and the reference voiceprint, the starting point of the first time window is later than the starting point of the second time window; according to the verification of the first voiceprint in the first time window As a result, the reference voiceprint is updated.
应理解的是,在上述技术方案中,预先确定第二时间窗的平均相似度,该第二时间窗的平均相似度和基准声纹共同作为基准参数,以用于判定是否更新基准声纹,可避免因第一用户在注册时,单一注册声纹不准确导致的判定不准确的问题。进一步的,确定第一时 间窗中多个第一声纹分别与基准声纹的相似度,并结合第二时间窗的平均相似度,确定第一用户的声纹是否发生长久变化。在第一用户的声纹发生长久变化的情况下,更新第一用户的基准声纹;而在第一用户的声纹未发生长久变化(比如发生短时变化)的情况下,不更新第一用户的基准声纹。有助于提高第一用户的基准声纹的准确性,进而在声纹识别中提高识别准确性和鲁棒性。It should be understood that, in the above technical solution, the average similarity of the second time window is predetermined, and the average similarity of the second time window and the reference voiceprint are used together as a reference parameter to determine whether to update the reference voiceprint, It can avoid the problem of inaccurate judgment caused by inaccurate single registration voiceprint when the first user registers. Further, determine the similarities between the multiple first voiceprints in the first time window and the reference voiceprint respectively, and combine the average similarity in the second time window to determine whether the voiceprint of the first user has undergone permanent changes. When the voiceprint of the first user changes permanently, the reference voiceprint of the first user is updated; and when the voiceprint of the first user does not change permanently (such as a short-term change), the first user’s reference voiceprint is not updated. The user's baseline voiceprint. It helps to improve the accuracy of the reference voiceprint of the first user, thereby improving the recognition accuracy and robustness in voiceprint recognition.
在一种可能的实现方式中,第一声纹的验证结果是根据第一声纹、基准声纹、第二时间窗的平均相似度确定的,包括:在第一声纹与基准声纹的相似度大于第二时间窗的平均相似度时,第一声纹的验证结果为通过验证,和/或,在第一声纹与基准声纹的相似度小于或等于第二时间窗的平均相似度时,第一声纹的验证结果为未通过验证。In a possible implementation, the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including: When the similarity is greater than the average similarity of the second time window, the verification result of the first voiceprint is verified, and/or, when the similarity between the first voiceprint and the reference voiceprint is less than or equal to the average similarity of the second time window When the degree is higher, the verification result of the first voiceprint is failed.
应理解的是,在上述技术方案中,将基准声纹和第二时间窗的平均相似度作为判定第一声纹是否通过验证的基准参数,避免出现因第一用户在注册时,由于偶然原因导致的注册声纹不准确的问题,从而有助于提高第一声纹的验证结果的准确性。进一步的,只需要存储第一时间窗中第一声纹的验证结果(通过验证或未通过验证)即可,无需存储第一时间窗中的多个第一声纹,有助于减少存储量。It should be understood that, in the above technical solution, the average similarity between the reference voiceprint and the second time window is used as the reference parameter for judging whether the first voiceprint passes the verification, so as to avoid accidental reasons when the first user registers. The problem of inaccurate registered voiceprints caused by this will help to improve the accuracy of the verification results of the first voiceprint. Further, it is only necessary to store the verification result of the first voiceprint in the first time window (passed or not verified), and there is no need to store multiple first voiceprints in the first time window, which helps to reduce the amount of storage .
在一种可能的实现方式中,根据第一时间窗中第一声纹的验证结果,更新基准声纹,包括:根据第一时间窗中通过验证的第一声纹的比率(即第一时间窗中达标率),更新基准声纹,其中,通过验证的第一声纹的比率为第一时间窗中通过验证的第一声纹的数量与第一时间窗中第一声纹的数量的比值。在一种可能的实现方式中,根据第一时间窗中通过验证的第一声纹的比率,更新基准声纹,包括:在第一时间窗中通过验证的第一声纹的比率小于或等于比率阈值时,更新基准声纹。In a possible implementation manner, updating the reference voiceprint according to the verification result of the first voiceprint in the first time window includes: according to the ratio of the first voiceprint that passes verification in the first time window (that is, the first time window), update the reference voiceprint, wherein the ratio of the first voiceprint that has passed verification is the ratio of the number of first voiceprints that have passed verification in the first time window to the number of first voiceprints in the first time window ratio. In a possible implementation manner, updating the reference voiceprint according to the ratio of the first voiceprint that passes verification in the first time window includes: the ratio of the first voiceprint that passes verification in the first time window is less than or equal to When the ratio threshold is reached, update the reference voiceprint.
应理解的是,在上述技术方案中,可以在第一时间窗中获取多个第一声纹的验证结果,然后根据第一时间窗中多个第一声纹的验证结果,确定通过验证的第一声纹的比率,该比率可用于指示第一用户的声纹是否发生长久改变,进而根据该第一声纹的比率更新或不更新第一用户的基准声纹。避免因为一个第一声纹的验证结果的偶然性,导致错误更新第一用户的基准声纹。有助于较为准确的识别出第一用户的声纹是否发生长久改变,从而实现在第一用户的声纹发生长久改变的情况下,更新第一用户的基准声纹;而在第一用户的声纹未发生长久改变的情况下,保持原有的第一用户的基准声纹。It should be understood that, in the above technical solution, the verification results of multiple first voiceprints can be obtained in the first time window, and then the verified ones can be determined according to the verification results of the multiple first voiceprints in the first time window. The ratio of the first voiceprint, which can be used to indicate whether the voiceprint of the first user has changed permanently, and then update or not update the reference voiceprint of the first user according to the ratio of the first voiceprint. Avoid updating the reference voiceprint of the first user by mistake due to the contingency of the verification result of a first voiceprint. It is helpful to more accurately identify whether the voiceprint of the first user has changed permanently, so as to update the reference voiceprint of the first user when the voiceprint of the first user has changed permanently; In the case that the voiceprint has not changed permanently, the original reference voiceprint of the first user is maintained.
在一种可能的实现方式中,更新基准声纹,包括:获取第一用户的第二语音信号,其中,第二语音信号与基准声纹的相似度,与第一时间窗的平均相似度的差值小于差值阈值,第一时间窗的平均相似度用于指示在第一时间窗中获取的第一用户的声纹与基准声纹的相似情况;根据第二语音信号,更新基准声纹。In a possible implementation manner, updating the reference voiceprint includes: acquiring the second voice signal of the first user, wherein the similarity between the second voice signal and the reference voiceprint is equal to the average similarity in the first time window The difference is less than the difference threshold, and the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; according to the second voice signal, update the reference voiceprint .
应理解的是,在上述技术方案中,获取的第二语音信号与基准声纹的相似度符合预设条件,可以理解为,从第一时间窗的起始点至获取到该第二语音信号的时段中,第一用户的声纹的变化小于变化阈值(或者理解,第一用户的声纹处于平稳状态,或未发生长久改变),则可以根据该第二语音信号,更新基准声纹。It should be understood that, in the above technical solution, the similarity between the acquired second voice signal and the reference voiceprint meets the preset condition, which can be understood as, from the starting point of the first time window to the time when the second voice signal is acquired During the time period, if the change of the voiceprint of the first user is less than the change threshold (or it is understood that the voiceprint of the first user is in a stable state or has not changed for a long time), the reference voiceprint can be updated according to the second voice signal.
在一种可能的实现方式中,根据第二语音信号,更新基准声纹,包括:根据第二语音信号和第二语音信号对应的文本,获得去重语音信号,其中去重语音信号对应的文本为非重复文本;根据去重语音信号,更新基准声纹。In a possible implementation manner, updating the reference voiceprint according to the second voice signal includes: obtaining the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, wherein the text corresponding to the de-emphasis voice signal It is non-repetitive text; update the reference voiceprint according to the de-emphasis voice signal.
在一种可能的实现方式中,第二语音信号为一个,根据第二语音信号和第二语音信号对应的文本,获得去重语音信号,包括:对第二语音信号对应的文本执行去重操作,得到 去重文本;根据去重文本在第二语音信号中对应的语音信号,获得去重语音信号。In a possible implementation manner, there is one second voice signal, and obtaining the deduplication voice signal according to the second voice signal and the text corresponding to the second voice signal includes: performing a deduplication operation on the text corresponding to the second voice signal , to obtain the de-emphasis text; according to the voice signal corresponding to the de-emphasis text in the second voice signal, the de-emphasis voice signal is obtained.
在一种可能的实现方式中,第二语音信号为多个,根据第二语音信号和第二语音信号对应的文本,获得去重语音信号,包括:针对多个第二语音信号中的第i个第二语音信号:根据历史去重文本,对第i个第二语音信号对应的文本执行去重操作;将去重操作之后的、第i个第二语音信号对应的文本与历史去重文本拼接,得到第1个语音信号至第i个第二语音信号共同对应的去重文本;其中历史去重文本是根据第1个第二语音信号至第i-1个第二语音信号分别对应的文本得到的,i大于1;根据多个第二语音信号共同对应的去重文本在多个第二语音信号中对应的语音信号,获得去重语音信号。In a possible implementation manner, there are multiple second voice signals, and obtaining the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal includes: for the i-th voice signal among the multiple second voice signals A second voice signal: according to the historical deduplication text, perform a deduplication operation on the text corresponding to the ith second voice signal; after the deduplication operation, the text corresponding to the i second voice signal and the historical deduplication text Splicing to obtain the deduplication text corresponding to the 1st voice signal to the i-th second voice signal; wherein the historical deduplication text is corresponding to the 1st second voice signal to the i-1th second voice signal respectively The text is obtained, i is greater than 1; the de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text in the multiple second voice signals corresponding to the multiple second voice signals.
应理解的是,在上述技术方案中,根据第二语音信号和第二语音信号对应的文本,对第二语音信号进行去重操作,得到去重语音信号,从而避免由于用户的第二语音信号中存在多次出现的高频文字或高频词语,影响提取到的基准声纹的准确性。It should be understood that, in the above technical solution, according to the second voice signal and the text corresponding to the second voice signal, the second voice signal is deduplicated to obtain the deemphasized voice signal, thereby avoiding the There are high-frequency characters or high-frequency words that appear many times in the file, which affects the accuracy of the extracted reference voiceprint.
在一种可能的实现方式中,去重语音信号的时长大于时长阈值,和/或,去重语音信号对应的非重复文本中文字个数大于字数阈值。In a possible implementation manner, the duration of the de-emphasis speech signal is greater than the duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than the word-number threshold.
应理解的是,在上述技术方案中,当去重语音信号的时长大于时长阈值,和/或,去重语音信号对应的非重复文本中文字个数大于字数阈值时,根据去重语音信号更新基准声纹,从而可以获取到足够时长和/或对应于足够字数的去重语音信号,进一步提高提取的基准声纹的准确性。It should be understood that, in the above technical solution, when the duration of the de-emphasis speech signal is greater than the duration threshold, and/or, when the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than the number of words threshold, update the text according to the de-emphasis speech signal The reference voiceprint, so that a deduplicated speech signal of sufficient duration and/or corresponding to a sufficient number of words can be obtained, and the accuracy of the extracted reference voiceprint can be further improved.
在一种可能的实现方式中,还包括:滑动第一时间窗,其中,滑动后的第一时间窗的终止点晚于滑动前的第一时间窗的终止点。In a possible implementation manner, the method further includes: sliding the first time window, wherein an end point of the first time window after sliding is later than an end point of the first time window before sliding.
应理解的是,在上述技术方案中,可以滑动第一时间窗,然后获取滑动后的第一时间窗中第一声纹的验证结果,根据滑动后的第一时间窗中第一声纹的验证结果,确定在滑动后的第一时间窗中通过验证的第一声纹的比率(即滑动后的第一时间窗中达标率)。其中滑动后的第一时间窗的终止点晚于滑动前的第一时间窗的终止点,且二者之间间隔的时长小于第一时间窗的时长。相当于,第一时间窗中达标率的计算是动态的,每隔一定时间间隔,都会确定该时间跨度(即第一时间窗)的达标率,从而可以及时发现第一用户的声纹发生长久变化,更新第一用户的注册声纹。It should be understood that, in the above technical solution, the first time window can be slid, and then the verification result of the first voiceprint in the slid first time window can be obtained, and according to the verification result of the first voiceprint in the slid first time window, For the verification result, determine the ratio of the first voiceprint that passes the verification in the first time window after sliding (that is, the compliance rate in the first time window after sliding). Wherein, the end point of the first time window after sliding is later than the end point of the first time window before sliding, and the interval between the two is shorter than the time length of the first time window. It is equivalent to that the calculation of the compliance rate in the first time window is dynamic, and the compliance rate of the time span (that is, the first time window) will be determined every certain time interval, so that the long-term occurrence of the voiceprint of the first user can be found in time. Change, and update the registered voiceprint of the first user.
在一种可能的实现方式中,第一时间窗的长度是可变的。In a possible implementation manner, the length of the first time window is variable.
应理解的是,在上述技术方案中,可以设置第一次数阈值,当确定出第一时间窗中第一声纹的验证结果的个数小于该第一次数阈值时,可以自动延长第一时间窗的时长,至确定出的第一声纹的验证结果的个数达到第一次数阈值。根据该达到第一次数阈值的、第一声纹的验证结果,有助于提高识别出第一用户的声纹发生长久变化的准确性。It should be understood that, in the above technical solution, the first number threshold can be set, and when it is determined that the number of verification results of the first voiceprint in the first time window is less than the first number threshold, the first number threshold can be automatically extended. The duration of a time window is until the number of determined verification results of the first voiceprint reaches the first number threshold. According to the verification result of the first voiceprint reaching the first number threshold, it is helpful to improve the accuracy of identifying the permanent change of the voiceprint of the first user.
在一种可能的实现方式中,还包括:滑动第二时间窗,其中,滑动后的第二时间窗的终止点晚于滑动前的第二时间窗的终止点,且滑动后的第二时间窗的起始点早于滑动后的第一时间窗的起始点。In a possible implementation manner, it also includes: sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the second time window after sliding The starting point of the window is earlier than the starting point of the first time window after sliding.
在一种可能的实现方式中,滑动后的第二时间窗的起始点早于滑动后的第一时间窗的起始点,包括:在更新基准声纹时,滑动第二时间窗,其中,滑动后的第二时间窗的起始点不早于基准声纹更新的时间点。In a possible implementation manner, the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding, including: when updating the reference voiceprint, sliding the second time window, wherein sliding The starting point of the subsequent second time window is no earlier than the time point of updating the reference voiceprint.
在一种可能的实现方式中,滑动后的第二时间窗的起始点早于滑动后的第一时间窗的起始点,包括:在更新基准声纹时,滑动第二时间窗,其中,滑动后的第二时间窗的起始点不早于第二语音信号获取的时间点。In a possible implementation manner, the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding, including: when updating the reference voiceprint, sliding the second time window, wherein sliding The starting point of the subsequent second time window is no earlier than the time point at which the second speech signal is acquired.
应理解的是,在上述技术方案中,在更新基准声纹时,可以滑动第二时间窗,然后获取滑动后的第二时间窗中第二声纹,确定滑动后的第二时间窗中第二声纹与基准声纹的相似度,得到滑动后的第二时间窗的平均相似度。在更新基准声纹之后,及时更新第二时间窗的平均相似度,有助于后续判定第一用户的声纹是否再次发生长久改变。It should be understood that, in the above technical solution, when updating the reference voiceprint, the second time window can be slid, and then the second voiceprint in the second time window after sliding can be acquired, and the second voiceprint in the second time window after sliding can be determined. The similarity between the second voiceprint and the reference voiceprint is used to obtain the average similarity of the second time window after sliding. After updating the reference voiceprint, updating the average similarity in the second time window in time is helpful for subsequently determining whether the voiceprint of the first user has undergone permanent changes again.
在一种可能的实现方式中,第二时间窗的长度是可变的。In a possible implementation manner, the length of the second time window is variable.
应理解的是,在上述技术方案中,可以设置第二次数阈值,当确定出第二时间窗中、第二声纹与基准声纹的相似度的个数小于该第二次数阈值时,可以自动延长第二时间窗的时长,至确定出的第二声纹与基准声纹的相似度的个数达到第二次数阈值。根据该达到第二次数阈值的、第二声纹对应的相似度,有助于提高第二时间窗的平均相似度的准确性。It should be understood that, in the above technical solution, a second times threshold may be set, and when it is determined that the number of similarities between the second voiceprint and the reference voiceprint in the second time window is less than the second times threshold, it may be The duration of the second time window is automatically extended until the number of determined similarities between the second voiceprint and the reference voiceprint reaches the second times threshold. According to the similarity corresponding to the second voiceprint reaching the threshold of the second number of times, it is helpful to improve the accuracy of the average similarity in the second time window.
在一种可能的实现方式中,获取第一时间窗中第一声纹的验证结果之前,还包括:根据基准声纹和第二时间窗中获取的第一用户的第二声纹,确定第二时间窗中的相似度;根据第二时间窗中的多个相似度,确定第二时间窗的平均相似度。In a possible implementation manner, before obtaining the verification result of the first voiceprint in the first time window, it further includes: determining the second voiceprint according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window. The similarity in the second time window; according to the multiple similarities in the second time window, determine the average similarity in the second time window.
应理解的是,在上述技术方案中,只需要存储第二时间窗的平均相似度即可,而无需存储第二时间窗中的多个第二声纹,有助于减少存储量。该第二时间窗的平均相似度用于确定第一时间窗中第一声纹的验证结果,容易实施。It should be understood that, in the above technical solution, it is only necessary to store the average similarity of the second time window, instead of storing multiple second voiceprints in the second time window, which helps to reduce the amount of storage. The average similarity in the second time window is used to determine the verification result of the first voiceprint in the first time window, which is easy to implement.
第二方面,本申请提供一种声纹管理装置,该装置可以是终端设备,或者是终端设备中的部件(比如处理装置、电路、芯片等)。其中,终端设备比如车载设备(如车机、车载电脑等)、用户终端(如手机、电脑等)。该装置还可以是服务器,或者是服务器中的部件(比如处理装置、电路、芯片等)。其中,服务器可以包括主机或处理器等实体设备,也可以包括虚拟机或容器等虚拟设备,还可以包括芯片或集成电路。服务器比如是车联网服务器,也称为云服务器、云、云端、云端服务器或云端控制器等,该车联网服务器可以是单个服务器,也可以是由多个服务器构成的服务器集群,具体不作限定。In a second aspect, the present application provides a voiceprint management device, which may be a terminal device, or a component (such as a processing device, circuit, chip, etc.) in the terminal device. Wherein, the terminal equipment includes vehicle-mounted equipment (such as a car machine, a vehicle-mounted computer, etc.), and a user terminal (such as a mobile phone, a computer, etc.). The device may also be a server, or a component (such as a processing device, a circuit, a chip, etc.) in a server. Wherein, the server may include physical devices such as hosts or processors, virtual devices such as virtual machines or containers, and may also include chips or integrated circuits. The server is, for example, an IoV server, also known as a cloud server, cloud, cloud, cloud server, or cloud controller. The IOV server may be a single server or a server cluster composed of multiple servers, which is not specifically limited.
在一种可能的实现方式中,本申请提供的声纹管理装置,包括:获取模块和处理模块;获取模块用于获取第一时间窗中第一声纹的验证结果,其中,第一声纹的验证结果是根据第一声纹、基准声纹、第二时间窗的平均相似度确定的,第一声纹是根据在第一时间窗中获取的第一用户的第一语音信号确定的,第二时间窗的平均相似度用于指示在第二时间窗中获取的第一用户的声纹与基准声纹的相似情况,第一时间窗的起始点晚于第二时间窗起始点;处理模块用于根据第一时间窗中第一声纹的验证结果,更新基准声纹。In a possible implementation, the voiceprint management device provided in this application includes: an acquisition module and a processing module; the acquisition module is used to acquire the verification result of the first voiceprint in the first time window, wherein the first voiceprint The verification result of is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, the first voiceprint is determined according to the first voice signal of the first user acquired in the first time window, The average similarity of the second time window is used to indicate the similarity between the voiceprint of the first user acquired in the second time window and the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window; processing The module is used to update the reference voiceprint according to the verification result of the first voiceprint in the first time window.
在一种可能的实现方式中,第一声纹的验证结果是根据第一声纹、基准声纹、第二时间窗的平均相似度确定的,包括:在第一声纹与基准声纹的相似度大于第二时间窗的平均相似度时,第一声纹的验证结果为通过验证。In a possible implementation, the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including: When the similarity is greater than the average similarity of the second time window, the verification result of the first voiceprint is verified.
在一种可能的实现方式中,处理模块具体用于:根据第一时间窗中通过验证的第一声纹的比率,更新基准声纹。In a possible implementation manner, the processing module is specifically configured to: update the reference voiceprint according to the ratio of the verified first voiceprint in the first time window.
在一种可能的实现方式中,处理模块具体用于:在第一时间窗中通过验证的第一声纹的比率小于或等于比率阈值时,更新基准声纹。In a possible implementation manner, the processing module is specifically configured to: update the reference voiceprint when the ratio of the verified first voiceprint in the first time window is less than or equal to a ratio threshold.
在一种可能的实现方式中,处理模块具体用于:控制获取模块获取第一用户的第二语音信号,其中,第二语音信号与基准声纹的相似度,与第一时间窗的平均相似度的差值小于差值阈值,第一时间窗的平均相似度用于指示在第一时间窗中获取的第一用户的声纹与基准声纹的相似情况;根据第二语音信号,更新基准声纹。In a possible implementation, the processing module is specifically configured to: control the acquisition module to acquire the second voice signal of the first user, wherein the similarity between the second voice signal and the reference voiceprint is similar to the average of the first time window degree difference is less than the difference threshold, the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; according to the second voice signal, update the reference Voiceprint.
在一种可能的实现方式中,处理模块具体用于:根据第二语音信号和第二语音信号对应的文本,获得去重语音信号,其中,去重语音信号对应的文本为非重复文本;根据去重语音信号,更新基准声纹。In a possible implementation manner, the processing module is specifically configured to: obtain the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, wherein the text corresponding to the de-emphasis voice signal is non-repetitive text; Deduplicate the voice signal and update the reference voiceprint.
在一种可能的实现方式中,第二语音信号为一个,处理模块具体用于:对第二语音信号对应的文本执行去重操作,得到去重文本;根据去重文本在第二语音信号中对应的语音信号,获得去重语音信号。In a possible implementation manner, there is one second voice signal, and the processing module is specifically configured to: perform a deduplication operation on the text corresponding to the second voice signal to obtain the deduplication text; The corresponding voice signal is obtained to obtain a de-emphasis voice signal.
在一种可能的实现方式中,第二语音信号为多个,处理模块具体用于:针对多个第二语音信号中的第i个第二语音信号:根据历史去重文本,对第i个第二语音信号对应的文本执行去重操作;将去重操作之后的、第i个第二语音信号对应的文本与历史去重文本拼接,得到第1个语音信号至第i个第二语音信号共同对应的去重文本;历史去重文本是根据第1个第二语音信号至第i-1个第二语音信号分别对应的文本得到的,i大于1;根据多个第二语音共同对应的去重文本在多个第二语音信号中对应的语音信号,获得去重语音信号。In a possible implementation manner, there are multiple second voice signals, and the processing module is specifically configured to: for the ith second voice signal among the multiple second voice signals: The text corresponding to the second voice signal performs a deduplication operation; after the deduplication operation, the text corresponding to the ith second voice signal is spliced with the historical deduplication text to obtain the 1st voice signal to the i th second voice signal Commonly corresponding deduplication text; historical deduplication text is obtained according to the texts corresponding to the 1st second voice signal to the i-1th second voice signal, i is greater than 1; according to the common correspondence of multiple second voices The voice signal corresponding to the de-emphasis text among the plurality of second voice signals is obtained to obtain the de-emphasis voice signal.
在一种可能的实现方式中,去重语音信号的时长大于时长阈值,和/或,去重语音信号对应的非重复文本中文字个数大于字数阈值。In a possible implementation manner, the duration of the de-emphasis speech signal is greater than the duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than the word-number threshold.
在一种可能的实现方式中,处理模块还用于:滑动第一时间窗,其中,滑动后的第一时间窗的终止点晚于滑动前的第一时间窗的终止点。In a possible implementation manner, the processing module is further configured to: slide the first time window, wherein an end point of the first time window after sliding is later than an end point of the first time window before sliding.
在一种可能的实现方式中,第一时间窗的长度是可变的。In a possible implementation manner, the length of the first time window is variable.
在一种可能的实现方式中,处理模块还用于:滑动第二时间窗,其中,滑动后的第二时间窗的终止点晚于滑动前的第二时间窗的终止点,且滑动后的第二时间窗的起始点早于滑动后的第一时间窗的起始点。In a possible implementation manner, the processing module is further configured to: slide the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the end point of the second time window after sliding is The starting point of the second time window is earlier than the starting point of the first time window after sliding.
在一种可能的实现方式中,在获取模块获取第一时间窗中第一声纹的验证结果之前,处理模块还用于:根据基准声纹和第二时间窗中获取的第一用户的第二声纹,确定第二时间窗中的相似度;根据第二时间窗中的多个相似度,确定第二时间窗的平均相似度。In a possible implementation manner, before the obtaining module obtains the verification result of the first voiceprint in the first time window, the processing module is further configured to: Two voiceprints, determining the similarity in the second time window; determining the average similarity in the second time window according to the multiple similarities in the second time window.
第三方面,本申请提供一种计算机程序产品,计算机程序产品包括计算机程序或指令,当计算机程序或指令被计算设备执行时,实现上述第一方面或第一方面的任一种可能的实现方式中的方法。In a third aspect, the present application provides a computer program product, the computer program product includes computer programs or instructions, and when the computer program or instructions are executed by a computing device, the above first aspect or any possible implementation of the first aspect can be realized method in .
第四方面,本申请提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序或指令,当计算机程序或指令被计算设备执行时,实现上述第一方面或第一方面的任一种可能的实现方式中的方法。In a fourth aspect, the present application provides a computer-readable storage medium, in which computer programs or instructions are stored, and when the computer programs or instructions are executed by a computing device, the above-mentioned first aspect or any of the first aspects can be realized. A method in one possible implementation.
第五方面,本申请提供一种计算设备,包括处理器,处理器与存储器相连,存储器用于存储计算机程序,处理器用于执行存储器中存储的计算机程序,以使得计算设备执行上述第一方面或第一方面的任一种可能的实现方式中的方法。In a fifth aspect, the present application provides a computing device, including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the computing device performs the above-mentioned first aspect or A method in any possible implementation manner of the first aspect.
第六方面,本申请实施例提供一种芯片系统,包括:处理器和存储器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该芯片系统实现上述第一方面或第一方面的任一种可能的实现方式中的方法。In the sixth aspect, the embodiment of the present application provides a chip system, including: a processor and a memory, the processor and the memory are coupled, the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the chip system realizes The above first aspect or the method in any possible implementation manner of the first aspect.
可选地,该芯片系统还包括接口电路,该接口电路用于交互代码指令至处理器。Optionally, the chip system further includes an interface circuit for exchanging code instructions to the processor.
可选地,该芯片系统中的处理器可以为一个或多个,该处理器可以通过硬件实现也可以通过软件实现。当通过硬件实现时,该处理器可以是逻辑电路、集成电路等。当通过软 件实现时,该处理器可以是一个通用处理器,通过读取存储器中存储的软件代码来实现。Optionally, there may be one or more processors in the chip system, and the processors may be implemented by hardware or by software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented in software, the processor may be a general-purpose processor, implemented by reading software codes stored in memory.
可选地,该芯片系统中的存储器也可以为一个或多个。该存储器可以与处理器集成在一起,也可以和处理器分离设置。示例性的,存储器可以是非瞬时性处理器,例如只读存储器ROM,其可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上。Optionally, there may be one or more memories in the chip system. The memory can be integrated with the processor, or can be set separately from the processor. Exemplarily, the memory may be a non-transitory processor, such as a read-only memory ROM, which may be integrated with the processor on the same chip, or may be respectively provided on different chips.
应理解的是,上述第一方面至第六方面的技术方案中,预先确定第一用户的基准声纹,然后在第二时间窗中获取第一用户的第二声纹,确定第二声纹与基准声纹的相似度,根据第二时间窗中的多个相似度,确定第二时间窗的平均相似度。由于第二时间窗在存储基准声纹的时间点之后,即用户在注册基准声纹的时间点至第二时间窗的终止点的时间段中,用户的声纹是处于平稳变化的,可以基于该处于平稳变化的第一用户的声纹,得到基准参数(即基准声纹和第二时间窗的平均相似度)。在后续判定用户声纹是否发生长久改变的过程中,可以基于第一时间窗中用户的第一声纹,确定第一声纹与基准声纹之间的相似度,将该相似度与第二时间窗的平均相似度相比较,确定第一声纹的验证结果,即确定第一时间窗中第一声纹是否与第二时间窗中第二声纹类似,进而确定第二时间窗的起始点至第一时间窗的终止点的时间段中,用户的声纹是否发生了长久改变。It should be understood that, in the above-mentioned technical solutions from the first aspect to the sixth aspect, the reference voiceprint of the first user is determined in advance, and then the second voiceprint of the first user is acquired in the second time window to determine the second voiceprint Based on the similarity with the reference voiceprint, an average similarity in the second time window is determined according to multiple similarities in the second time window. Since the second time window is after the time point when the reference voiceprint is stored, that is, during the time period from the time point when the user registers the reference voiceprint to the end point of the second time window, the user's voiceprint is in a steady change, which can be based on The voiceprint of the first user that is changing steadily is used to obtain a reference parameter (that is, the average similarity between the reference voiceprint and the second time window). In the subsequent process of determining whether the user's voiceprint has undergone permanent changes, the similarity between the first voiceprint and the reference voiceprint can be determined based on the user's first voiceprint in the first time window, and the similarity can be compared with the second Compare the average similarity of the time window to determine the verification result of the first voiceprint, that is, determine whether the first voiceprint in the first time window is similar to the second voiceprint in the second time window, and then determine the start of the second time window During the time period from the start point to the end point of the first time window, whether the user's voiceprint has undergone a long-term change.
而且,本申请中确定第一时间窗中多个第一声纹的验证结果,确定第一时间窗中通过验证的第一声纹的比率(即第一时间窗中达标率),根据第一时间窗中达标率确定用户的声纹是否发生了长久改变,避免因为第一时间窗中用户偶然原因导致误判断,有助于提高判定的准确性。Moreover, in this application, the verification results of multiple first voiceprints in the first time window are determined, and the ratio of the first voiceprints that pass verification in the first time window (that is, the compliance rate in the first time window) is determined. According to the first The compliance rate in the time window determines whether the user's voiceprint has changed for a long time, avoiding misjudgment caused by the user's accidental reasons in the first time window, and helping to improve the accuracy of the judgment.
进一步的,在不更新基准声纹时,可以滑动第一时间窗,确定滑动后的第一时间窗中,通过验证的第一声纹的比率(即滑动后的第一时间窗中达标率),根据滑动后的第一时间窗中达标率确定用户的声纹是否发生了长久改变,有助于及时发现第一用户的声纹发生长久变化,更新第一用户的注册声纹。Further, when the reference voiceprint is not updated, the first time window can be slid to determine the ratio of the verified first voiceprint in the first time window after sliding (that is, the compliance rate in the first time window after sliding) , according to the compliance rate in the first time window after sliding, it is determined whether the voiceprint of the user has undergone a permanent change, which is helpful to timely discover the permanent change of the voiceprint of the first user and update the registered voiceprint of the first user.
进一步的,在更新基准声纹时,获取第二语音信号,根据获取的第二语音信号确定第一用户的声纹,无需与用户交互,做到用户无感知,有助于提高用户体验。而且无需预先存储第一时间窗中多个第一声纹,有助于减少存储量。本申请中也无需存储第二时间窗中多个第二声纹,仅需要保存有第二时间窗的平均相似度即可,进一步减少存储量。Further, when updating the reference voiceprint, the second voice signal is acquired, and the voiceprint of the first user is determined according to the acquired second voice signal, without interacting with the user, so that the user has no perception, which helps to improve user experience. Moreover, there is no need to pre-store multiple first voiceprints in the first time window, which helps to reduce the amount of storage. In this application, there is no need to store multiple second voiceprints in the second time window, and only the average similarity of the second time window needs to be saved, further reducing the storage capacity.
附图说明Description of drawings
图1a为本申请示例性提供的一种语音信号处理系统的结构示意图;Fig. 1a is a schematic structural diagram of a speech signal processing system provided by the present application;
图1b为本申请示例性提供的一种语义理解过程的流程示意图;Fig. 1b is a schematic flow diagram of a semantic understanding process exemplarily provided by the present application;
图1c为本申请示例性提供的一种声纹提取过程的流程示意图;Fig. 1c is a schematic flow chart of a voiceprint extraction process exemplarily provided by the present application;
图2为本申请示例性提供的一种车载场景示意图;FIG. 2 is a schematic diagram of a vehicle scene provided by the present application;
图3为本申请示例性提供的另一种车载场景示意图;FIG. 3 is a schematic diagram of another vehicle-mounted scene provided by the present application;
图4为本申请示例性提供的一种手机界面的显示示意图;Fig. 4 is a schematic display diagram of a mobile phone interface exemplarily provided by the present application;
图5为本申请示例性提供的一种声纹管理流程与时间的对应关系的示意图;Fig. 5 is a schematic diagram of the corresponding relationship between a voiceprint management process and time provided by the present application;
图6为本申请示例性提供的一种声纹验证的流程示意图;Fig. 6 is a schematic flow diagram of a voiceprint verification provided by the present application;
图7为本申请示例性提供的又一种声纹管理流程与时间的对应关系的示意图;FIG. 7 is a schematic diagram of another voiceprint management process and the corresponding relationship between time provided by the present application;
图8为本申请示例性提供的一种更新基准声纹的流程示意图;Fig. 8 is a schematic flow chart of updating a reference voiceprint provided by the present application;
图9为本申请示例性提供的又一种车载场景示意图;FIG. 9 is a schematic diagram of yet another vehicle-mounted scene exemplarily provided by the present application;
图10为本申请示例性提供的再一种声纹管理方法的流程示意图;Fig. 10 is a schematic flowchart of another voiceprint management method exemplarily provided by the present application;
图11为本申请示例性提供的一种声纹管理装置的结构示意图;Fig. 11 is a schematic structural diagram of a voiceprint management device exemplarily provided by the present application;
图12为本申请示例性提供的另一种声纹管理装置的结构示意图。Fig. 12 is a schematic structural diagram of another voiceprint management device provided by the present application as an example.
具体实施方式Detailed ways
下面将结合附图,对本申请实施例进行详细描述。Embodiments of the present application will be described in detail below in conjunction with the accompanying drawings.
如图1a为本申请示例性提供的一种语音信号处理系统的结构示意图,该语音信号处理系统中至少可以包括有语音采集设备、声纹管理设备和语义理解设备。Fig. 1a is a schematic structural diagram of a voice signal processing system exemplarily provided in the present application, the voice signal processing system may at least include a voice collection device, a voiceprint management device and a semantic understanding device.
示例性的,语音信号处理系统可以通过语音采集设备采集用户的语音信号,将语音信号分别输入至语义理解设备和声纹管理设备中,其中语义理解设备可用于对语音信号执行语义提取流程,得到该用户的语音信号对应的机器可识别指令;声纹管理设备可以根据语音信号确定出用户的声纹特征向量,并基于声纹特征向量执行相应的识别过程。Exemplarily, the voice signal processing system can collect the user's voice signal through the voice collection device, and input the voice signal into the semantic understanding device and the voiceprint management device respectively, wherein the semantic understanding device can be used to perform a semantic extraction process on the voice signal, and obtain The machine-recognizable instruction corresponding to the user's voice signal; the voiceprint management device can determine the user's voiceprint feature vector according to the voice signal, and perform a corresponding recognition process based on the voiceprint feature vector.
语音采集设备可以是麦克风阵列(microphone array),麦克风阵列可以是由一定数目的声学传感器(一般是麦克风)组成。麦克风阵列可以有下列一项或多项功能:语音增强,从含噪声的语音信号中提取出纯净语音的过程;声源定位,使用麦克风阵列来计算目标说话人的角度和距离,从而实现对目标说话人的跟踪以及后续的语音定向拾取;去混响,减少一些反射声的影响;声源信号提取/分离,将多个混合声音全部提取出来。麦克风阵列可以适用于车载、户外、超市等多杂音、噪音、回音的复杂环境。The voice collection device may be a microphone array (microphone array), and the microphone array may be composed of a certain number of acoustic sensors (usually microphones). The microphone array can have one or more of the following functions: speech enhancement, the process of extracting pure speech from noisy speech signals; sound source localization, using the microphone array to calculate the angle and distance of the target speaker, so as to achieve target Speaker tracking and subsequent directional voice pickup; de-reverberation to reduce the impact of some reflected sound; sound source signal extraction/separation to extract all mixed sounds. Microphone arrays can be applied to complex environments with multiple noises, noises, and echoes in vehicles, outdoors, and supermarkets.
参照图1b示例性示出的一种语义理解过程的流程示意图,语义理解设备可以依次对语音信号执行如下处理:Referring to a schematic flowchart of a semantic understanding process exemplarily shown in FIG. 1b, the semantic understanding device may sequentially perform the following processing on the speech signal:
(1)自动语音识别(auto speech recognition,ASR):将用户输入的语音信号转换成自然语言文本。一个可能的方式中,可将语音信号作为声波进行处理,具体的,对语音信号进行分帧处理,得到每帧对应的一小段波形。针对于每帧对应的一小段波形,对该一小段波形按照人耳特征转变为多维向量信息,其中,每帧时长可以为约20ms至30ms。根据该多维向量信息,解码得到该多维向量信息对应的多个音素(phone),将该多个音素组成字词并串联成语句(即自然语言文本)输出。(1) Automatic speech recognition (auto speech recognition, ASR): convert the speech signal input by the user into natural language text. In a possible manner, the voice signal may be processed as a sound wave. Specifically, the voice signal is processed in frames to obtain a small segment of waveform corresponding to each frame. For a small segment of waveform corresponding to each frame, the small segment of waveform is transformed into multi-dimensional vector information according to human ear characteristics, wherein the duration of each frame may be about 20ms to 30ms. According to the multi-dimensional vector information, multiple phonemes (phones) corresponding to the multi-dimensional vector information are decoded, and the multiple phonemes are formed into words and concatenated into sentences (ie, natural language text) for output.
(2)自然语言处理(natural language processing,NLP):将自然语言文本中有意义的部分转换为机器可以理解的结构化信息。(2) Natural language processing (NLP): Convert meaningful parts of natural language text into structured information that machines can understand.
(3)语义槽位填充:将自然语言处理所得的结构化信息填充至对应的槽位中,使得用户意图转化为机器可识别的用户指令。(3) Semantic slot filling: Fill the structured information obtained from natural language processing into the corresponding slots, so that user intentions can be converted into machine-recognizable user instructions.
参照图1c示例性示出的一种声纹提取过程的流程示意图,声纹管理设备中可以包括声纹提取模块,声纹提取模块可用于对语音信号进行声纹提取,得到相应的声纹特征向量。示例性的,声纹提取模块可以依次对语音信号执行预处理(也可称为前处理)、声纹特征提取和后处理。Referring to a schematic flow chart of a voiceprint extraction process shown in Figure 1c, the voiceprint management device may include a voiceprint extraction module, and the voiceprint extraction module may be used to perform voiceprint extraction on voice signals to obtain corresponding voiceprint features vector. Exemplarily, the voiceprint extraction module may perform pre-processing (also called pre-processing), voiceprint feature extraction and post-processing on the voice signal in sequence.
(1)预处理:提取语音信号中的音频特征信息。示例性的,在音频特征信息的提取过程中,可以至少对语音信号执行如下操作中的一项或多项:去噪,语音活动检测(voice  activity detection,VAD),感知线性预测(perceptual linear predictive,PLP),梅尔倒谱系数(mel-frequency cepstral coefficients,MFCC)计算。(1) Preprocessing: extract the audio feature information in the speech signal. Exemplarily, in the process of extracting audio feature information, at least one or more of the following operations may be performed on the speech signal: denoising, voice activity detection (voice activity detection, VAD), perceptual linear predictive (perceptual linear predictive) , PLP), Mel-frequency cepstral coefficients (MFCC) calculation.
(2)声纹特征提取:将该音频特征信息传输至声纹特征提取模型中,相应的,声纹特征提取模型输出声纹特征向量。声纹特征提取模型包括但不限于:高斯混合模型(gaussian mixed model,GMM),联合因子分析(joint factor analysis,JFA)模型,i-vector(i-向量)模型,d-vector(d-向量)模型,x-vector(x-向量)模型中一项或多项。本申请中,可以将声纹特征向量简称为声纹。(2) Voiceprint feature extraction: the audio feature information is transmitted to the voiceprint feature extraction model, and correspondingly, the voiceprint feature extraction model outputs a voiceprint feature vector. Voiceprint feature extraction models include but are not limited to: Gaussian mixed model (gaussian mixed model, GMM), joint factor analysis (joint factor analysis, JFA) model, i-vector (i-vector) model, d-vector (d-vector ) model, one or more items in the x-vector (x-vector) model. In this application, the voiceprint feature vector may be referred to as voiceprint for short.
(3)后处理:对声纹特征提取模型输出的声纹执行后处理,得到最终的声纹。示例性的,后处理可包括如下中的一项或多项:线性判别分析(linear discriminant analysis,LDA),概率线性判别分析(probabilistic linear discriminant analysis,PLDA),扰动属性投影(nuisance attribute projection,NAP)。(3) Post-processing: Perform post-processing on the voiceprint output by the voiceprint feature extraction model to obtain the final voiceprint. Exemplary, the post-processing may include one or more of the following: linear discriminant analysis (linear discriminant analysis, LDA), probabilistic linear discriminant analysis (probabilistic linear discriminant analysis, PLDA), disturbance attribute projection (nuisance attribute projection, NAP ).
如上,通过声纹提取模块可从语音信号中提取到对应的声纹,声纹同人脸、指纹、虹膜等一样,属于生物特征信息(biometric information)的一种,可以根据声纹,确定发起该语音信号的用户的身份信息。根据声纹识别用户的身份信息,相比于人脸识别可不受用户面部遮挡限制,相比于指纹识别可不受物理接触限制,可实施性较强。As above, through the voiceprint extraction module, the corresponding voiceprint can be extracted from the voice signal. The voiceprint is the same as the face, fingerprint, iris, etc., and belongs to a kind of biometric information. According to the voiceprint, it can be determined to initiate the Identity information of the user of the voice signal. According to the voiceprint recognition of the user's identity information, compared with face recognition, it is not restricted by the user's facial occlusion, and compared with fingerprint recognition, it is not restricted by physical contact, and it is more implementable.
声纹管理设备中还可以包括声纹识别模块,声纹识别模块可用于根据声纹,对用户的身份信息进行识别。在一种可能的实现方式中,声纹识别模块中预先存储有用户的基准声纹,声纹识别模块可将来自声纹提取模块的声纹(可称为采集声纹)与基准声纹作比对,确定采集声纹和基准声纹是否对应于同一个用户。The voiceprint management device may also include a voiceprint recognition module, which can be used to identify the user's identity information according to the voiceprint. In a possible implementation, the voiceprint recognition module pre-stores the user's reference voiceprint, and the voiceprint recognition module can compare the voiceprint (which can be called the collected voiceprint) from the voiceprint extraction module with the reference voiceprint. Compare to determine whether the collected voiceprint and the reference voiceprint correspond to the same user.
本申请中,声纹识别模块可以根据相似度阈值,确定采集声纹和基准声纹是否对应于同一个用户。本申请中,声纹识别模块可以将N个用户的注册声纹和采集声纹的相似度作为样本,根据该样本确定出相似度阈值,其中N为正整数。一个可选方式中,声纹识别模块可针对N个用户中,每个用户的注册声纹和采集声纹的相似度,得到N个相似度。然后根据该N个相似度确定出相似度阈值。示例性,声纹识别模块可将N个相似度的平均值,或者中位数值,作为相似度阈值。比如,相似度阈值可以取0.75。In this application, the voiceprint recognition module may determine whether the collected voiceprint and the reference voiceprint correspond to the same user according to the similarity threshold. In this application, the voiceprint recognition module may take the similarity between registered voiceprints and collected voiceprints of N users as samples, and determine the similarity threshold based on the samples, where N is a positive integer. In an optional manner, the voiceprint recognition module can obtain N similarities between the registered voiceprint and the collected voiceprint of each user among the N users. Then a similarity threshold is determined according to the N similarities. For example, the voiceprint recognition module may use the average value or median value of N similarities as the similarity threshold. For example, the similarity threshold may be 0.75.
第一个示例中,声纹识别模块若确定采集声纹和基准声纹之间的相似度,大于相似度阈值,则可以确定采集声纹和基准声纹对应于同一个用户。声纹识别模块若确定采集声纹和基准声纹之间的相似度,小于或等于相似度阈值,则可以确定采集声纹和基准声纹对应于不同用户。本申请中,可以将采集声纹和基准声纹之间的相似度大于相似度阈值,理解为,采集声纹和基准声纹二者相匹配;可以将采集声纹和基准声纹之间的相似度小于或等于相似度阈值,理解为,采集声纹和基准声纹二者不匹配。In the first example, if the voiceprint recognition module determines that the similarity between the collected voiceprint and the reference voiceprint is greater than the similarity threshold, it can determine that the collected voiceprint and the reference voiceprint correspond to the same user. If the voiceprint recognition module determines that the similarity between the collected voiceprint and the reference voiceprint is less than or equal to the similarity threshold, it can determine that the collected voiceprint and the reference voiceprint correspond to different users. In this application, the similarity between the collected voiceprint and the reference voiceprint can be greater than the similarity threshold, which means that the collected voiceprint and the reference voiceprint match; If the similarity is less than or equal to the similarity threshold, it is understood that the collected voiceprint and the reference voiceprint do not match.
在第二个示例中,声纹识别模块还可以在确定采集声纹和基准声纹之间的相似度,大于或等于相似度阈值的情况下,确定采集声纹和基准声纹对应于同一个用户。声纹识别模块在确定采集声纹和基准声纹之间的相似度,小于相似度阈值的情况下,确定采集声纹和基准声纹对应于不同用户。相应的,本申请中可以将采集声纹和基准声纹之间的相似度大于或等于相似度阈值,理解为,采集声纹和基准声纹二者相匹配;可以将采集声纹和基准声纹之间的相似度小于相似度阈值,理解为,采集声纹和基准声纹二者不匹配。In the second example, the voiceprint recognition module may also determine that the collected voiceprint and the reference voiceprint correspond to the same user. The voiceprint recognition module determines that the collected voiceprint and the reference voiceprint correspond to different users when the similarity between the collected voiceprint and the reference voiceprint is determined to be less than the similarity threshold. Correspondingly, in this application, the similarity between the collected voiceprint and the reference voiceprint can be greater than or equal to the similarity threshold, which means that the collected voiceprint and the reference voiceprint match; If the similarity between the fingerprints is less than the similarity threshold, it is understood that the collected voiceprint and the reference voiceprint do not match.
如下为方便描述,均以第一个示例为例说明。For the convenience of description, the first example is taken as an example.
需要补充的是,在采集声纹与预先注册的多个基准声纹之间的相似度均大于相似度阈值时,可以将该多个基准声纹中、与采集声纹的相似度最大的基准声纹作为该采集声纹的 匹配声纹,即确定该多个基准声纹中、与采集声纹的相似度最大的基准声纹与该采集声纹相匹配,而其他的基准声纹与该采集声纹不匹配。What needs to be added is that when the similarities between the collected voiceprint and multiple pre-registered reference voiceprints are all greater than the similarity threshold, the reference with the greatest similarity to the collected voiceprint among the multiple reference voiceprints can be The voiceprint is used as the matching voiceprint of the collected voiceprint, that is, it is determined that among the multiple reference voiceprints, the reference voiceprint with the greatest similarity to the collected voiceprint matches the collected voiceprint, while other reference voiceprints match the collected voiceprint. The collected voiceprints do not match.
声纹识别可用于身份识别过程中,示例性的,声纹识别模块中存储有一个或多个用户的基准声纹,声纹识别模块可以将采集声纹与已存储的一个或多个基准声纹进行比对,从一个或多个基准声纹中确定出与当前采集声纹相匹配的基准声纹,然后根据该确定出的基准声纹确定用户的身份信息。进一步的,声纹识别模块中可以存储有不同用户对应的不同权限,声纹识别模块在根据采集声纹确定出用户的身份信息之后,可以进而根据用户的身份信息,确定该用户对应的权限。Voiceprint recognition can be used in the identification process. Exemplarily, one or more reference voiceprints of users are stored in the voiceprint recognition module, and the voiceprint recognition module can combine the collected voiceprint with the stored one or more reference voiceprints. The user's identity information is determined according to the determined reference voiceprint, and the reference voiceprint matching the currently collected voiceprint is determined from one or more reference voiceprints. Further, different permissions corresponding to different users may be stored in the voiceprint recognition module. After the voiceprint recognition module determines the user's identity information according to the collected voiceprint, it can further determine the user's corresponding authority according to the user's identity information.
声纹识别还可用于身份验证过程中,比如声纹识别模块中存储有通过身份验证的一个或多个用户的基准声纹,声纹识别模块可以将采集声纹与已存储的一个或多个基准声纹进行比对,确定是否存在与采集声纹相匹配的基准声纹,若是,则可以确定当前采集声纹对应的用户可通过身份验证,否则,可以确定当前采集声纹对应的用户未通过身份验证。Voiceprint recognition can also be used in the identity verification process. For example, the reference voiceprint of one or more users who have passed the identity verification is stored in the voiceprint recognition module. The voiceprint recognition module can combine the collected voiceprint with one or more stored Compare the reference voiceprint to determine whether there is a reference voiceprint that matches the collected voiceprint. If so, it can be determined that the user corresponding to the currently collected voiceprint can pass the identity verification; otherwise, it can be determined that the user corresponding to the currently collected voiceprint has not passed the authentication. Authenticated.
为了帮助理解本方案,如下结合使用场景解释说明本申请中声纹识别的流程。示例性的,该使用场景中声纹管理设备可以是车载设备(例如,车机,车载电脑等),车载设备可以基于基准声纹,确定当前采集声纹对应的用户的身份信息。In order to help understand this solution, the process of voiceprint recognition in this application is explained in conjunction with the usage scenarios as follows. Exemplarily, the voiceprint management device in this usage scenario may be a vehicle-mounted device (for example, a car machine, a vehicle-mounted computer, etc.), and the vehicle-mounted device may determine the identity information of the user corresponding to the currently collected voiceprint based on the reference voiceprint.
一个示例中,车载设备可以基于基准声纹,确定用户是否通过身份验证。具体的,车载设备仅为车主提供“车辆违章信息”的查询功能。车载设备可存储用户A(车主)的基准声纹a,并标注该基准声纹a对应于车主。参照图2中(a)示出的场景中,当用户A查询车辆违章信息时,可以说“查询车辆违章信息”,此时,车载设备可以提取该语音信号“查询车辆违章信息”中的声纹,确定提取出的声纹与基准声纹a相匹配,响应于该语音信号,在显示屏中显示当前违章信息比如“2021年4月20日,闯红灯,扣6分”。参照图2中(b)示出的场景中,当用户B(非车主)查询车辆违章信息时,车载设备可以提取用户B的语音信号“查询车辆违章信息”中的声纹,确定提取出的声纹与基准声纹a不匹配,提示用户B查询失败,比如在显示界面中显示“仅限于车主查询”。In one example, the vehicle device can determine whether the user has passed the identity verification based on the reference voiceprint. Specifically, the vehicle-mounted device only provides the vehicle owner with the query function of "vehicle violation information". The vehicle-mounted device can store the reference voiceprint a of user A (vehicle owner), and mark that the reference voiceprint a corresponds to the vehicle owner. With reference to the scene shown in Fig. 2 (a), when user A inquires about vehicle violation information, it can be said "inquire about vehicle violation information", at this moment, the vehicle-mounted device can extract the sound in the voice signal "inquire about vehicle violation information". Confirm that the extracted voiceprint matches the reference voiceprint a, and in response to the voice signal, display the current violation information on the display screen, such as "On April 20, 2021, 6 points will be deducted for running a red light". With reference to the scene shown in (b) in Figure 2, when user B (non-car owner) inquires about vehicle violation information, the vehicle-mounted device can extract the voiceprint in user B's voice signal "query vehicle violation information", and determine the extracted If the voiceprint does not match the reference voiceprint a, user B is prompted that the query failed, for example, "Only limited to car owner query" is displayed on the display interface.
上述示例也可以理解,车载设备可以在确定出用户身份信息之后,基于不同的用户身份信息,确定不同用户对应的不同权限。具体的,车主具备查询车辆违章信息的权限,而非车主不具备查询车辆违章信息的权限。当车载设备确定用户是车主时,则为用户提供查询车辆违章信息的功能,当车载设备确定用户不是车主时,则拒绝为用户提供查询车辆违章信息的功能。在车载设备根据不同的用户身份信息,确定不同用户对应的不同权限的示例中,还可以是车载设备中设置驾驶员对应的基准声纹,当车载设备采集到的用户声纹与驾驶员对应的基准声纹相匹配时,车载设备可以确定当前用户为驾驶员,相应的,可以为当前用户提供与驾驶员相对应的权限,比如,可以通过语音信号控制车辆行驶。It can also be understood from the above example that after determining the user identity information, the vehicle-mounted device can determine different permissions corresponding to different users based on different user identity information. Specifically, car owners have the right to query vehicle violation information, while non-car owners do not have the right to query vehicle violation information. When the vehicle-mounted device determines that the user is the owner of the vehicle, it provides the user with the function of querying the vehicle violation information, and when the vehicle-mounted device determines that the user is not the vehicle owner, it refuses to provide the user with the function of querying the vehicle violation information. In the example where the vehicle-mounted device determines different permissions corresponding to different users according to different user identity information, it is also possible to set the reference voiceprint corresponding to the driver in the vehicle-mounted device. When the user's voiceprint collected by the vehicle-mounted device corresponds to the When the reference voiceprint matches, the on-vehicle device can determine that the current user is the driver, and correspondingly, can provide the current user with the authority corresponding to the driver, for example, can control the driving of the vehicle through voice signals.
再一个示例中,车载设备为不同用户提供不同推荐内容。具体的,车载设备可存储用户A的基准声纹a,且记录用户A喜欢摇滚乐;以及存储用户B的基准声纹b,且记录用户B喜欢轻音乐。参照图3中(a)示出的场景中,当用户A说“打开音乐”时,车载设备可以提取语音信号“打开音乐”中的声纹,将该提取出的声纹分别与存储的基准声纹a和基准声纹b作比对,确定提取出的声纹与基准声纹a相匹配,在显示界面中推荐摇滚乐列表。参照图3中(b)示出的场景中,当用户B说“打开音乐”时,车载设备可以提取语音信号“打开音乐”中的声纹,将该提取出的声纹分别与存储的基准声纹a和基准声纹 b作比对,确定提取出的声纹与基准声纹b相匹配,在显示界面中推荐轻音乐列表。In another example, the in-vehicle device provides different recommended content for different users. Specifically, the vehicle-mounted device can store the reference voiceprint a of user A, and record that user A likes rock music; and store the reference voiceprint b of user B, and record that user B likes light music. With reference to the scene shown in (a) in Figure 3, when user A says "turn on the music", the vehicle-mounted device can extract the voiceprint in the voice signal "turn on the music", and compare the extracted voiceprint with the stored reference The voiceprint a is compared with the reference voiceprint b to determine that the extracted voiceprint matches the reference voiceprint a, and a list of rock music is recommended in the display interface. With reference to the scene shown in (b) in Figure 3, when user B says "turn on the music", the vehicle-mounted device can extract the voiceprint in the voice signal "turn on the music", and compare the extracted voiceprint with the stored reference The voiceprint a is compared with the reference voiceprint b to determine that the extracted voiceprint matches the reference voiceprint b, and a light music list is recommended on the display interface.
当然,在本例子中,车载设备在确定提取出的声纹与基准声纹a相匹配时,也可以直接播放摇滚乐;或者,车载设备在确定提取出的声纹与基准声纹b相匹配时,也可以直接播放轻音乐。或者还可以是其他实现方式,本申请不限定。Of course, in this example, when the vehicle-mounted device determines that the extracted voiceprint matches the reference voiceprint a, it can also directly play rock music; or, when the vehicle-mounted device determines that the extracted voiceprint matches the reference voiceprint b , you can also play light music directly. Alternatively, other implementation manners may also be used, which are not limited in this application.
结合上述声纹管理设备根据不同用户身份信息,确定用户对应于不同的权限,示例性的提供另一种使用场景,该使用场景中声纹管理设备可以是用户终端,用户终端比如是手机。手机中预先存储有机主用户的基准声纹,手机可确定采集声纹与基准声纹二者之间是否匹配,进而确定当前采集声纹对应的用户是否为机主用户。当手机处于锁屏状态时,机主用户可以通过语音信号指示手机解锁并执行相应动作。比如图4示出的场景中,当机主用户想要使用该手机中相册应用时,可以对处于锁屏状态的手机说“打开相册应用”,其中处于锁屏状态的手机的显示界面可参见图4中(a)所示。手机获取到语音信号“打开相册应用”并提取声纹,当手机确定提取出的声纹与基准声纹相匹配时,可以响应于该语音信号,开启对应的相册应用,手机的显示界面可参见图4中(b)所示。Combining with the above-mentioned voiceprint management device determining that the user corresponds to different rights according to different user identity information, another usage scenario is exemplarily provided. In this usage scenario, the voiceprint management device may be a user terminal, such as a mobile phone. The reference voiceprint of the organic main user is pre-stored in the mobile phone, and the mobile phone can determine whether the collected voiceprint matches the reference voiceprint, and then determine whether the user corresponding to the currently collected voiceprint is the main user. When the mobile phone is in the locked screen state, the owner user can instruct the mobile phone to unlock and perform corresponding actions through voice signals. For example, in the scene shown in Figure 4, when the owner user wants to use the photo album application in the mobile phone, he can say "open the photo album application" to the mobile phone in the locked screen state, and the display interface of the locked mobile phone can be found in Shown in (a) in Figure 4. The mobile phone obtains the voice signal "open the photo album application" and extracts the voiceprint. When the mobile phone confirms that the extracted voiceprint matches the reference voiceprint, it can respond to the voice signal and open the corresponding photo album application. For the display interface of the mobile phone, see Shown in (b) in Figure 4.
上述场景中,声纹管理设备需要基于较为准确的基准声纹,来比对采集声纹,进而基于比对结果(相匹配或不匹配)执行相应的动作。即,声纹管理设备中存储的基准声纹的准确性,会影响到声纹识别的准确性。In the above scenario, the voiceprint management device needs to compare and collect voiceprints based on a relatively accurate reference voiceprint, and then perform corresponding actions based on the comparison result (matching or not matching). That is, the accuracy of the reference voiceprint stored in the voiceprint management device will affect the accuracy of voiceprint recognition.
用户声纹可能会发生短时变化或长久变化,其中,短时变化指的是,用户由于外界暂时刺激而发生的声纹可逆改变,比如用户感冒导致声纹发生的可逆改变。长久变化是指,用户的生理性改变导致了用户声纹发生的不可逆改变。声纹管理设备需要基于发生了长久变化的用户声纹,更新用户的基准声纹,而无需基于发生了短时变化的用户声纹,更新用户的基准声纹,从而提高基准声纹的准确性,以提高声纹识别的准确性。The user's voiceprint may undergo short-term or long-term changes. Short-term changes refer to reversible changes in the user's voiceprint due to temporary external stimuli, such as reversible changes in the user's voiceprint caused by a cold. Permanent changes refer to the irreversible changes in the user's voiceprint caused by the physiological changes of the user. The voiceprint management device needs to update the user's reference voiceprint based on the user's voiceprint that has changed for a long time, instead of updating the user's reference voiceprint based on the user's voiceprint that has undergone short-term changes, so as to improve the accuracy of the reference voiceprint , to improve the accuracy of voiceprint recognition.
如此,需要较为准确的区分出用户声纹是否发生变化,且发生的变化属于短时变化还是长久变化。进而基于发生长久变化的用户声纹,更新用户的基准声纹。In this way, it is necessary to more accurately distinguish whether the voiceprint of the user has changed, and whether the change is a short-term change or a long-term change. Then, based on the user's voiceprint that has changed for a long time, the user's reference voiceprint is updated.
本申请示例性提供一种声纹管理方法,该声纹管理方法可以由声纹管理装置执行。示例性的,声纹管理装置可以是图1a示例性示出的声纹管理设备。The present application exemplarily provides a voiceprint management method, which can be executed by a voiceprint management device. Exemplarily, the voiceprint management apparatus may be the voiceprint management device exemplarily shown in FIG. 1a.
示例性的,声纹管理装置可以是终端设备,或者是终端设备中的部件(比如处理装置、电路、芯片等)。其中,终端设备比如车载设备(如车机、车载电脑等)、用户终端(如手机、电脑等)。Exemplarily, the voiceprint management apparatus may be a terminal device, or a component (such as a processing device, a circuit, a chip, etc.) in the terminal device. Wherein, the terminal equipment includes vehicle-mounted equipment (such as a car machine, a vehicle-mounted computer, etc.), and a user terminal (such as a mobile phone, a computer, etc.).
再示例性的,声纹管理装置可以是服务器,或者是服务器中的部件(比如处理装置、电路、芯片等)。其中,服务器可以包括主机或处理器等实体设备,也可以包括虚拟机或容器等虚拟设备,还可以包括芯片或集成电路。服务器比如是车联网服务器,也称为云服务器、云、云端、云端服务器或云端控制器等,该车联网服务器可以是单个服务器,也可以是由多个服务器构成的服务器集群,具体不作限定。As another example, the voiceprint management device may be a server, or a component (such as a processing device, circuit, chip, etc.) in the server. Wherein, the server may include physical devices such as hosts or processors, virtual devices such as virtual machines or containers, and may also include chips or integrated circuits. The server is, for example, an IoV server, also known as a cloud server, cloud, cloud, cloud server, or cloud controller. The IOV server may be a single server or a server cluster composed of multiple servers, which is not specifically limited.
本申请提供的声纹管理方法中,可以基于基准声纹和基准相似度,确定当前时间窗中采集的声纹是否通过验证,进而确定是否需要更新用户的基准声纹。In the voiceprint management method provided by the present application, based on the reference voiceprint and the reference similarity, it can be determined whether the voiceprint collected in the current time window has passed the verification, and then it can be determined whether the user's reference voiceprint needs to be updated.
为更好的解释本申请实施例,结合图5示例性示出的声纹管理流程与时间的对应关系,依次对声纹管理方法中获取基准声纹、获取基准相似度和声纹验证三个流程解释如下。In order to better explain the embodiment of the present application, in combination with the corresponding relationship between the voiceprint management process and time shown in Fig. 5, the voiceprint management method is sequentially analyzed in order to obtain the reference voiceprint, obtain the reference similarity and voiceprint verification. The flow is explained below.
预先说明的是,下述三个流程均是针对于同一个用户(或称为注册人、说话人等),获取该用户的基准声纹、基准相似度,对该用户的声纹进行验证,确定是否需要更新该用 户的基准声纹。在具体实现中,可以基于声纹比对、人脸识别等技术来确定是同一个用户。It is stated in advance that the following three processes are all for the same user (or called registrant, speaker, etc.), obtain the user's reference voiceprint and reference similarity, and verify the user's voiceprint. Determine whether the user's reference voiceprint needs to be updated. In a specific implementation, the same user can be determined based on technologies such as voiceprint comparison and face recognition.
在声纹比对中,相似度阈值可用于指示两个声纹是否对应于同一个用户,理解为,无论用户的声纹是否发生改变,该用户的采集声纹与基准声纹之间的相似度均大于相似度阈值;而不同用户的声纹之间的相似度小于相似度阈值。比如用户A注册的基准声纹是基准声纹a,用户B注册的基准声纹是基准声纹b,基准声纹a与基准声纹b二者之间的相似度小于相似度阈值。当用户A说话时,根据采集到的用户A的语音信号,确定出的声纹与基准声纹a的相似度大于相似度阈值,而与基准声纹b的相似度小于相似度阈值,确定出采集到的声纹对应的用户是用户A,而非用户B。In voiceprint comparison, the similarity threshold can be used to indicate whether two voiceprints correspond to the same user. It is understood that, regardless of whether the user's voiceprint changes, the similarity between the user's collected voiceprint and the reference voiceprint The degrees are greater than the similarity threshold; while the similarity between the voiceprints of different users is less than the similarity threshold. For example, the reference voiceprint registered by user A is reference voiceprint a, the reference voiceprint registered by user B is reference voiceprint b, and the similarity between reference voiceprint a and reference voiceprint b is less than the similarity threshold. When user A speaks, according to the collected voice signal of user A, the similarity between the determined voiceprint and the reference voiceprint a is greater than the similarity threshold, while the similarity with the reference voiceprint b is less than the similarity threshold, and the determined The collected voiceprint corresponds to user A, not user B.
在人脸识别中,可以通过人脸图像确定用户正在说话时,采集到的语音信号即为用户的语音信号,根据该用户的语音信号提取的声纹,即为用户声纹。可以是基于用户口型确定用户正在说话,用户口型可以是按照预设规律张开和/或闭合,若采集到与预设规律相对应的语音信号,即可以确定当前用户正在说话,且获取到的是用户下发的语音信号。In face recognition, when the user is speaking can be determined through the face image, the collected voice signal is the user's voice signal, and the voiceprint extracted according to the user's voice signal is the user's voiceprint. It can be determined based on the user's mouth shape that the user is speaking. The user's mouth shape can be opened and/or closed according to a preset rule. If a voice signal corresponding to the preset rule is collected, it can be determined that the current user is speaking, and the acquired What is received is the voice signal sent by the user.
本申请中,还可以是通过其他方式识别用户。比如,可以设置用户账号,当用户通过账号登录之后,下发语音信号,则可以确定出获取到的语音信号与账号登录对应的用户为同一个用户。其中,账号还可以替换成用户指纹,当用户通过指纹登录之后,下发语音信号,则可以确定出获取到的语音信号与指纹登录对应的用户为同一个用户。In this application, the user may also be identified in other ways. For example, a user account can be set, and when the user logs in through the account and sends a voice signal, it can be determined that the obtained voice signal and the user corresponding to the account login are the same user. Wherein, the account number can also be replaced by the user's fingerprint. When the user sends a voice signal after logging in through the fingerprint, it can be determined that the obtained voice signal and the user corresponding to the fingerprint login are the same user.
本申请中,还可以是将上述声纹比对、人脸识别、账号验证中的一种或多种结合起来,确定同一个用户,以提高确定出同一个用户的准确性。In this application, one or more of the above-mentioned voiceprint comparison, face recognition, and account verification can also be combined to determine the same user, so as to improve the accuracy of determining the same user.
为方便描述,在下述实施例中,将针对的该同一个用户称为是第一用户。For convenience of description, in the following embodiments, the same user targeted is referred to as the first user.
一、获取基准声纹流程:1. Obtaining the benchmark voiceprint process:
第一用户可以在注册时间点(如图5中(a)所示的t0)注册声纹,其中,第一用户注册的声纹可称为是注册声纹或基准声纹。一种可能实现方式中,可以在显示界面中显示一段预设文字,第一用户读该段预设文字,获取到第一用户当前的语音信号,基于该语音信号提取声纹,得到第一用户的基准声纹,将第一用户的基准声纹存储。The first user may register a voiceprint at a registration time point (t0 as shown in FIG. 5(a)), where the voiceprint registered by the first user may be called a registered voiceprint or a reference voiceprint. In a possible implementation, a preset text may be displayed on the display interface, and the first user reads the preset text to obtain the current voice signal of the first user, and extracts the voiceprint based on the voice signal to obtain the first user's voiceprint. The reference voiceprint of the first user is stored.
二、获取基准相似度流程:2. Obtaining the benchmark similarity process:
由于基准声纹是基于第一用户的一段语音信号提取的,为避免由于偶然原因导致的基准声纹不准确,本申请可进一步获取注册时间点之后的、第二预设时长的基准相似度,该基准相似度和基准声纹可共同作为基准参数,用于确定是否需要更新基准声纹,具体可参见下述声纹验证流程中描述。Since the reference voiceprint is extracted based on a segment of the voice signal of the first user, in order to avoid the inaccuracy of the reference voiceprint due to accidental reasons, the application can further obtain the reference similarity of the second preset duration after the registration time point, The benchmark similarity and the benchmark voiceprint can be used together as a benchmark parameter to determine whether the benchmark voiceprint needs to be updated. For details, please refer to the description in the following voiceprint verification process.
本申请中又可以把该注册时间点之后的、第二预设时长对应的时段称为是基准时间段、基准时间窗、第二时间窗等。其中,第二时间窗可以在特定情况下执行滑动操作,且该第二时间窗的实际时长是可变的。In the present application, the period after the registration time point and corresponding to the second preset duration may be referred to as a reference time period, a reference time window, a second time window, and the like. Wherein, the second time window can perform a sliding operation under specific circumstances, and the actual duration of the second time window is variable.
在具体获取过程中,可以获取第二时间窗(如图5中(a)所示的t0-t1)中第一用户的语音信号,从该语音信号中提取声纹,该声纹即为在第二时间窗中采集到的第一用户的声纹(可称为第二声纹)。可以将第二声纹与基准声纹进行比对,具体的,可以根据相似度算法(比如余弦相似度算法、欧式距离算法等)确定第二声纹与基准声纹之间的相似度。In the specific acquisition process, the voice signal of the first user in the second time window (t0-t1 as shown in Figure 5 (a)) can be acquired, and the voiceprint is extracted from the voice signal, and the voiceprint is the The voiceprint of the first user collected in the second time window (may be referred to as the second voiceprint). The second voiceprint can be compared with the reference voiceprint. Specifically, the similarity between the second voiceprint and the reference voiceprint can be determined according to a similarity algorithm (such as cosine similarity algorithm, Euclidean distance algorithm, etc.).
进一步的,可以在第二时间窗中采集多个第二声纹,也即可以确定出多个第二声纹分别与基准声纹之间相似度。一个示例中,可以将该多个相似度之间的平均值,作为第二时间窗对应的基准相似度(可称为平均相似度)。当然本申请中,还可以将该多个相似度中的中位数,或者将最大值和最小值二者之间的平均值,作为第二时间窗对应的基准相似度, 或者其他方式。如下,可以第二时间窗对应的平均相似度举例说明。当然本申请中,可以将平均相似度替换为基准相似度,以表示相同意思。Further, multiple second voiceprints may be collected in the second time window, that is, similarities between the multiple second voiceprints and the reference voiceprint may be determined respectively. In an example, an average value among the multiple similarities may be used as a reference similarity corresponding to the second time window (which may be referred to as an average similarity). Of course, in this application, the median of the multiple similarities, or the average value between the maximum value and the minimum value, may also be used as the reference similarity degree corresponding to the second time window, or in other ways. As follows, the average similarity corresponding to the second time window is used as an example for illustration. Of course, in this application, the average similarity can be replaced by the benchmark similarity to represent the same meaning.
此处,为了提高第二时间窗的平均相似度的准确性,可以预设第二次数阈值,当确定出第二预设时长的第二时间窗中、第二声纹个数小于该第二次数阈值时,可以自动延长第二时间窗的时长,至确定出第二时间窗中第二声纹的个数达到第二次数阈值。Here, in order to improve the accuracy of the average similarity of the second time window, a second threshold can be preset. When the times threshold is reached, the duration of the second time window may be automatically extended until it is determined that the number of second voiceprints in the second time window reaches the second times threshold.
示例性的,第二次数阈值为10个,比图5中(a)示出的t0-t1内,仅仅采集到的8个第二声纹。为了提高第二时间窗的平均相似度的准确性,可以自动延长第二时间窗的时长,比如直至到达t1’时,才采集完成第10个第二声纹,则可以确定出第二时间窗的终止点是t1’,即第二时间窗是t0-t1’。进一步的,可以基于10个第二声纹分别与基准声纹的相似度,确定第二时间窗的平均相似度。Exemplarily, the threshold for the second number of times is 10, which is only 8 second voiceprints collected within t0-t1 shown in (a) of FIG. 5 . In order to improve the accuracy of the average similarity of the second time window, the duration of the second time window can be automatically extended, for example, the 10th second voiceprint is not collected until t1' is reached, then the second time window can be determined The termination point of is t1', that is, the second time window is t0-t1'. Further, the average similarity in the second time window may be determined based on the similarities between the 10 second voiceprints and the reference voiceprint respectively.
三、声纹验证流程:3. Voiceprint verification process:
本申请中可以获取验证时间段(可称为第一时间窗)中的第一用户的多个声纹,根据获取到的多个声纹,确定第一用户的声纹是否发生了长久改变,即是否需要根据发生长久改变之后第一用户的声纹,来更新第一用户的基准声纹。In this application, multiple voiceprints of the first user in the verification time period (may be referred to as the first time window) may be acquired, and based on the acquired multiple voiceprints, it is determined whether the voiceprint of the first user has changed permanently, That is, whether it is necessary to update the reference voiceprint of the first user based on the voiceprint of the first user after a long-term change.
本申请中,第一时间窗的起始点晚于第二时间窗的起始点,第一时间窗可以与第二时间窗存在部分重叠,或不存在重叠。第一时间窗可以设定为第一预设时长,第一预设时长可以与第二预设时长相等或不等。进一步的,第一时间窗可以在特定情况下执行滑动操作,且该第一时间窗的实际时长是可变的。In the present application, the starting point of the first time window is later than the starting point of the second time window, and the first time window may partially overlap or not overlap with the second time window. The first time window may be set as a first preset duration, and the first preset duration may be equal to or different from the second preset duration. Further, the first time window may perform a sliding operation under specific circumstances, and the actual duration of the first time window is variable.
示例性的,图6示例性示出的一种声纹验证的流程示意图,该流程中:Exemplarily, FIG. 6 exemplarily shows a schematic flow diagram of voiceprint verification, in this flow:
步骤601,获取第一时间窗中第一语音信号。Step 601, acquire a first speech signal in a first time window.
可以获取第一时间窗(如图5中(a)所示的t1-t2)中第一用户的第一语音信号,并根据图1c示例性示出的声纹提取流程,从该第一语音信号中提取第一用户的第一声纹。The first voice signal of the first user in the first time window (t1-t2 as shown in Figure 5(a)) can be obtained, and according to the voiceprint extraction process exemplarily shown in Figure 1c, from the first voice signal The first voiceprint of the first user is extracted from the signal.
步骤602,从第一语音信号中提取第一声纹,确定第一声纹的验证结果。Step 602, extracting the first voiceprint from the first voice signal, and determining the verification result of the first voiceprint.
可以将第一声纹与基准声纹进行比对,确定第一声纹的验证结果。具体的,可以确定第一声纹与基准声纹之间的相似度,在第一声纹与基准声纹之间的相似度大于第二时间窗对应的平均相似度时,记录该第一声纹通过验证;在第一声纹与基准声纹之间的相似度不大于第二时间窗对应的平均相似度时,记录该第一声纹未通过验证。The first voiceprint can be compared with the reference voiceprint to determine the verification result of the first voiceprint. Specifically, the similarity between the first voiceprint and the reference voiceprint can be determined, and when the similarity between the first voiceprint and the reference voiceprint is greater than the average similarity corresponding to the second time window, record the first voiceprint If the similarity between the first voiceprint and the reference voiceprint is not greater than the average similarity corresponding to the second time window, it is recorded that the first voiceprint has failed the verification.
一种可选方式中,可以通过一个比特位来记录对应的第一声纹是否通过验证,比如,当第一声纹通过验证时,记录对应比特位的取值为1;当第一声纹未通过验证时,记录对应比特位的取值为0。上述技术方案中,无需存储多个第一声纹,而仅需要存储多个第一声纹的验证结果,每个验证结果可以占用一个比特位,从而有助于减少存储空间。In an optional manner, a bit can be used to record whether the corresponding first voiceprint passes the verification, for example, when the first voiceprint passes the verification, the value of the corresponding bit is 1; when the first voiceprint When the verification is not passed, the value of the corresponding bit of the record is 0. In the above technical solution, there is no need to store a plurality of first voiceprints, but only verification results of a plurality of first voiceprints need to be stored, and each verification result can occupy one bit, thereby helping to reduce storage space.
步骤603,根据第一时间窗中第一声纹的验证结果,确定是否需要更新基准声纹。若无需更新,则可重新执行声纹验证流程。若需要更新,则可执行声纹更新流程。Step 603: Determine whether to update the reference voiceprint according to the verification result of the first voiceprint in the first time window. If no update is required, the voiceprint verification process can be performed again. If an update is required, the voiceprint update process can be performed.
可以在第一时间窗中采集到多个第一声纹,也即可以确定出多个第一声纹的验证结果。可以基于多个第一声纹的验证结果,确定是否更新第一用户的基准声纹。Multiple first voiceprints may be collected in the first time window, that is, verification results of multiple first voiceprints may be determined. Whether to update the reference voiceprint of the first user may be determined based on the verification results of multiple first voiceprints.
一个可选示例中,可以统计通过验证的第一声纹的比率(可称为达标率),即统计通过验证的第一声纹的数量、与第一声纹的数量的比值。该比率大于比率阈值,即表征当前第一用户的声纹未发生长久改变,无需更新第一用户的基准声纹。该比率小于或等于比率阈值,即表征当前第一用户的声纹发生长久改变,需要更新第一用户的基准声纹。In an optional example, the ratio of the verified first voiceprints (which may be referred to as a compliance rate) may be counted, that is, the ratio of the number of verified first voiceprints to the number of first voiceprints may be counted. The ratio is greater than the ratio threshold, which means that the current voiceprint of the first user has not changed for a long time, and there is no need to update the reference voiceprint of the first user. The ratio is less than or equal to the ratio threshold, which means that the current voiceprint of the first user has undergone permanent changes, and the reference voiceprint of the first user needs to be updated.
举例来说,假设第二时间窗对应的平均相似度为0.85,比率阈值为70%。在第一时间 窗中共计获取5个第一声纹,根据该5个第一声纹和基准声纹,确定出5个第一声纹分别与基准声纹的相似度比如是0.90、0.90、0.90、0.80、0.86,则可以得到5个第一声纹的验证结果分别是1、1、1、0、1,进而确定出第一时间窗中通过验证的第一声纹的比率为80%(大于比率阈值70%),无需更新第一用户的基准声纹。For example, assume that the average similarity corresponding to the second time window is 0.85, and the ratio threshold is 70%. A total of 5 first voiceprints are obtained in the first time window, and according to the 5 first voiceprints and the reference voiceprint, the similarities between the 5 first voiceprints and the reference voiceprint are determined to be, for example, 0.90, 0.90, 0.90, 0.80, 0.86, then the verification results of the five first voiceprints can be obtained as 1, 1, 1, 0, 1 respectively, and then the ratio of the first voiceprints that passed the verification in the first time window is determined to be 80%. (greater than 70% of the ratio threshold), there is no need to update the reference voiceprint of the first user.
此处,为了提高验证的准确率,可以预设第一次数阈值,当确定出第一预设时长的第一时间窗中第一声纹的验证结果的个数小于该第一次数阈值时,可以自动延长第一时间窗的时长,至确定出第一时间窗中第一声纹的验证结果的个数达到第一次数阈值。示例性的,第一次数阈值为10个,比图5中(a)示出t1-t2内,仅仅采集到的8个第一声纹,则可以自动延长第一时间窗的时长,比如直至到达t2’时,才采集完成第10个第一声纹,则可以确定出第一时间窗的终止点是t2’,即第一时间窗是t1-t2’。进一步的,可以基于10个第一声纹的验证结果,确定是否更新基准声纹。Here, in order to improve the accuracy of the verification, the first number threshold can be preset, and when it is determined that the number of verification results of the first voiceprint in the first time window of the first preset duration is less than the first number threshold , the duration of the first time window may be automatically extended until it is determined that the number of verification results of the first voiceprint in the first time window reaches the first number threshold. Exemplarily, the threshold of the first number is 10, compared to the 8 first voiceprints collected within t1-t2 shown in (a) in Figure 5, the duration of the first time window can be automatically extended, such as The tenth first voiceprint is not collected until t2' is reached, and it can be determined that the end point of the first time window is t2', that is, the first time window is t1-t2'. Further, it may be determined whether to update the reference voiceprint based on the verification results of the 10 first voiceprints.
本步骤中,还可以进一步确定第一时间窗中多个第一声纹分别与基准声纹之间的相似度的平均值(可称为第一时间窗的平均相似度),第一时间窗的平均相似度用于指示第一时间窗中获取的第一用户的声纹与基准声纹的相似情况。该第一时间窗的平均相似度可以用于后续基准声纹的更新,具体可参见下述实施例。In this step, it is also possible to further determine the average value of the similarities between a plurality of first voiceprints in the first time window and the reference voiceprint (which may be referred to as the average similarity of the first time window), the first time window The average similarity of is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint. The average similarity in the first time window may be used for subsequent updating of the reference voiceprint, for details, please refer to the following embodiments.
步骤604,滑动第一时间窗。Step 604, slide the first time window.
一个可选方式中,可以将第一时间窗向后滑动第三预设时长,第三预设时长可以小于第一预设时长,或者可以小于第二预设时长。其中滑动后的第一时间窗的终止点晚于滑动前的第一时间窗的终止点。示例性的,可参见图5中(a)和(b),第一时间窗由滑动前的t1-t2,滑动至t3-t4,其中,t2和t4之间的间隔即为第三预设时长。一个具体实现中,第一预设时长可以是7天,第二预设时长可以是7天,第三预设时长可以是1天。In an optional manner, the first time window may be slid backward for a third preset duration, and the third preset duration may be shorter than the first preset duration, or may be shorter than the second preset duration. The end point of the first time window after sliding is later than the end point of the first time window before sliding. For example, see (a) and (b) in Figure 5, the first time window slides from t1-t2 before sliding to t3-t4, where the interval between t2 and t4 is the third preset duration. In a specific implementation, the first preset duration may be 7 days, the second preset duration may be 7 days, and the third preset duration may be 1 day.
由于在一些实施例中,可能会延长第一时间窗的长度至第一时间窗中第一声纹的验证结果的次数达到第一次数阈值,也即,第一时间窗的时长是可变的,可能存在滑动前的第一时间窗的时长大于第一预设时长的情况。在该种情况下,可以将第一时间窗的终止点向后滑动第三预设时长,然后基于滑动后的第一时间窗的终止点和第一预设时长,确定滑动后的第一时间窗的起始点。Since in some embodiments, the length of the first time window may be extended until the number of verification results of the first voiceprint in the first time window reaches the first number threshold, that is, the length of the first time window is variable Yes, there may be a situation where the duration of the first time window before sliding is greater than the first preset duration. In this case, the end point of the first time window can be slid backward for a third preset duration, and then the first time after sliding can be determined based on the end point of the first time window after sliding and the first preset duration The starting point of the window.
结合图7举例来说,比如在t1-t2(t1和t2之间间隔第一预设时长)时间段内,确定出的第一时间窗中第一声纹的验证结果的次数未达到第一次数阈值,则将第一时间窗的终止点后移至t2’(t1和t2’之间间隔大于第一预设时长)。在滑动第一时间窗时,可以是将t2’+第三预设时长对应的时间点,作为滑动后的第一时间窗的终止点,比如表示为t4’。随后,可以确定出滑动后的第一时间窗的起始点为t4’-第一预设时长,比如表示为t3’。Referring to FIG. 7 as an example, for example, within the time period t1-t2 (the interval between t1 and t2 is the first preset duration), the number of verification results of the first voiceprint determined in the first time window does not reach the first The number of times threshold, the end point of the first time window is moved back to t2' (the interval between t1 and t2' is greater than the first preset time length). When sliding the first time window, the time point corresponding to t2'+the third preset duration can be used as the end point of the first time window after sliding, for example, it is expressed as t4'. Subsequently, it can be determined that the starting point of the first time window after sliding is t4'-the first preset duration, for example, it is expressed as t3'.
进一步的,可以在滑动后的第一时间窗中继续采集第一用户的第一声纹,确定采集到的第一声纹与基准声纹的相似度,进而确定第一声纹的验证结果。可以在滑动后的第一时间窗中,确定出多个第一声纹的验证结果,根据多个第一声纹的验证结果,确定是否需要更新第一用户的基准声纹。即在步骤604之后,可以继续执行上述步骤601至步骤603,直至更新第一用户的基准声纹。Further, the first voiceprint of the first user may be continuously collected in the first time window after sliding, the similarity between the collected first voiceprint and the reference voiceprint is determined, and then the verification result of the first voiceprint is determined. The verification results of multiple first voiceprints may be determined in the first time window after sliding, and it is determined whether the reference voiceprint of the first user needs to be updated according to the verification results of the multiple first voiceprints. That is, after step 604, the above steps 601 to 603 may be continued until the reference voiceprint of the first user is updated.
示例性的,参见图5中(a)至(c),第一时间窗可以由t1-t2逐渐滑动至t5-t6。进一步的,比如当滑动第一时间窗至t5-t6时,根据t5-t6中获取的多个第一声纹的验证结果,确定需要执行声纹更新流程,也即可以在t6时启动步骤605至步骤606的声纹更新流程。Exemplarily, referring to (a) to (c) in FIG. 5 , the first time window may gradually slide from t1-t2 to t5-t6. Further, for example, when sliding the first time window to t5-t6, according to the verification results of multiple first voiceprints obtained in t5-t6, it is determined that the voiceprint update process needs to be executed, that is, step 605 can be started at t6 Go to step 606, the voiceprint update process.
步骤605,获取第二语音信号,其中,第二语音信号与基准声纹的相似度,与第一时 间窗的平均相似度的差值小于差值阈值。Step 605, acquire the second voice signal, wherein the difference between the similarity between the second voice signal and the reference voiceprint and the average similarity in the first time window is less than the difference threshold.
可以在启动声纹更新流程之后,采集第一用户的语音信号,进而确定该语音信号是否符合预设条件,若是,则将该语音信号用于更新基准声纹,否则,将该语音信号丢弃。After the voiceprint update process is started, the voice signal of the first user can be collected to determine whether the voice signal meets the preset condition, and if so, the voice signal is used to update the reference voiceprint, otherwise, the voice signal is discarded.
示例性的,在启动声纹更新流程之后,根据采集到的语音信号确定声纹,确定该声纹与基准声纹之间的相似度。若该相似度与第一时间窗的平均相似度的差值小于差值阈值,则可将该采集到的语音信号(即第二语音信号)用于更新基准声纹。若该相似度与第一时间窗的平均相似度的差值不小于差值阈值,则可以将该采集到的语音信号丢弃。示例性的,差值阈值取值为0.1。Exemplarily, after the voiceprint update process is started, the voiceprint is determined according to the collected voice signal, and the similarity between the voiceprint and the reference voiceprint is determined. If the difference between the similarity and the average similarity in the first time window is smaller than the difference threshold, the collected voice signal (ie, the second voice signal) may be used to update the reference voiceprint. If the difference between the similarity and the average similarity in the first time window is not less than the difference threshold, the collected speech signal may be discarded. Exemplarily, the difference threshold is 0.1.
步骤606,根据第二语音信号,更新基准声纹。Step 606: Update the reference voiceprint according to the second voice signal.
本申请中,可以在启动声纹更新流程之后,获取到一个或多个第二语音信号。可以对一个或多个第二语音信号进行去重操作、或者进行去重操作和拼接操作,得到去重操作、或者去重操作和拼接操作之后的语音信号,然后从中提取声纹。In this application, one or more second voice signals may be obtained after the voiceprint update process is started. A deduplication operation, or a deduplication operation and splicing operation may be performed on one or more second voice signals to obtain the deduplication operation, or the voice signal after the deduplication operation and splicing operation, and then the voiceprint is extracted therefrom.
本申请中,考虑到第一用户在下发语音信号时,语音信号中可能存在高频词语,比如唤醒词,而该高频词语的过多出现可能会影响提取语音信号中声纹的准确性。如此,需要对第二语音信号进行去重操作,具体的,可以是对第二语音信号执行ASR得到第二语音信号对应的文本,对第二语音信号对应的文本执行去重操作,得到不存在重复文本的文本(可称为去重文本)。进一步的,基于去重文本和第二语音信号,得到去重操作之后的语音信号(可称为去重语音信号)。In this application, it is considered that when the first user sends a voice signal, there may be high-frequency words in the voice signal, such as wake-up words, and the excessive occurrence of such high-frequency words may affect the accuracy of voiceprint extraction from the voice signal. In this way, it is necessary to perform a deduplication operation on the second speech signal. Specifically, ASR may be performed on the second speech signal to obtain the text corresponding to the second speech signal, and a deduplication operation is performed on the text corresponding to the second speech signal to obtain the non-existent text. Text that repeats text (may be referred to as deduplicated text). Further, based on the deduplication text and the second voice signal, a voice signal after the deduplication operation (which may be called a deduplication voice signal) is obtained.
进一步的,为提高更新声纹的准确性,可以预设更新条件,当去重语音信号满足该更新条件时,可以根据去重语音信号更新基准声纹。示例性的,该更新条件可以是,去重语音信号的时长大于时长阈值,和/或,去重语音信号对应的非重复文本(也可称为去重文本)中文字字数大于字数阈值。为方便描述,如下例子中,均以更新条件为去重语音信号的时长大于时长阈值,和,去重语音信号对应的非重复文本中文字字数大于字数阈值说明。Further, in order to improve the accuracy of updating the voiceprint, an update condition can be preset, and when the de-emphasis voice signal meets the update condition, the reference voiceprint can be updated according to the de-emphasis voice signal. Exemplarily, the update condition may be that the duration of the de-emphasized speech signal is greater than the duration threshold, and/or the number of characters in the non-repetitive text (also referred to as de-duplicated text) corresponding to the de-emphasized voice signal is greater than the word-count threshold. For the convenience of description, in the following examples, the update condition is that the duration of the de-emphasized speech signal is greater than the duration threshold, and that the number of words in the non-repetitive text corresponding to the de-emphasized speech signal is greater than the word-number threshold.
示例性的,如图5中(d),在t6开启更新基准声纹的流程,随后采集一个或多个第二语音信号,至t7得到符合更新条件的去重语音信号,并在t7根据去重语音信号更新基准声纹。可以将t6-t7理解为第三时间窗,第三时间窗即用于获取一个或多个第二语音信号。Exemplarily, as shown in (d) in FIG. 5 , the process of updating the reference voiceprint is started at t6, and then one or more second voice signals are collected, until t7 to obtain a de-emphasized voice signal that meets the update condition, and at t7 according to the de-duplication Heavy speech signals update the reference voiceprint. t6-t7 may be understood as a third time window, and the third time window is used to acquire one or more second voice signals.
如下,结合图8示例性提供的一种更新基准声纹的流程示意图,解释说明上述步骤605和步骤606。The above step 605 and step 606 are explained below in combination with a schematic flow chart of updating a reference voiceprint provided in FIG. 8 .
步骤801,获取第1个第二语音信号。Step 801, acquire the first second voice signal.
步骤802,根据第1个第二语音信号,确定第1个第二语音信号对应的去重语音信号。Step 802, according to the first second voice signal, determine the de-emphasis voice signal corresponding to the first second voice signal.
具体的,对第1个第二语音信号执行ASR,得到第1个第二语音信号对应的文本;对第1个第二语音信号对应的文本执行去重操作,得到第1个第二语音信号对应的去重文本(可称为第一去重文本)。Specifically, perform ASR on the first second voice signal to obtain the text corresponding to the first second voice signal; perform a deduplication operation on the text corresponding to the first second voice signal to obtain the first second voice signal The corresponding deduplication text (may be referred to as the first deduplication text).
示例性的,比如第一用户想要唤醒某个设备,该设备的唤醒词为“小A”,则第一用户可以下发第二语音信号“小A小A”,在采集到该第二语音信号之后,对该第二语音信号执行ASR处理,得到第二语音信号对应的文本,即“小A小A”。进一步的,确定出“小A小A”中包括重复文本“小A”,则可以执行去重操作,得到的第一去重文本为“小A”。同理,若第一用户下发第二语音信号“小A,小A,请打开空调”,则根据该第二语音信号得到的第一去重文本为“小A请打开空调”。Exemplarily, for example, if the first user wants to wake up a certain device, and the wake-up word of the device is "Little A", the first user can send the second voice signal "Little A, Little A", and after the second voice signal is collected, After the speech signal, ASR processing is performed on the second speech signal to obtain the text corresponding to the second speech signal, that is, "小A小A". Further, if it is determined that "Little A Xiao A" includes the repeated text "Little A", then the deduplication operation may be performed, and the obtained first deduplicated text is "Little A". Similarly, if the first user sends a second voice signal "Little A, little A, please turn on the air conditioner", then the first deduplication text obtained according to the second voice signal is "Little A, please turn on the air conditioner".
当然,第二语音信号对应的文本中可能不存在重复文本,比如第一用户下发第二语音信号“你好,小A”,或“小A,请打开空调”,或“小A,请打开音乐”等。此时,在对第二语音信号执行完ASR之后,确定第二语音信号对应的文本中不存在重复文本,则直接将该第二语音信号对应的文本作为第一去重文本。Of course, there may not be repeated text in the text corresponding to the second voice signal. For example, the first user sends the second voice signal "Hello, little A", or "Little A, please turn on the air conditioner", or "Little A, please Turn on the music" etc. At this time, after the ASR is performed on the second speech signal, it is determined that there is no repeated text in the text corresponding to the second speech signal, and the text corresponding to the second speech signal is directly used as the first deduplication text.
进一步的,可以确定第一去重文本在第1个第二语音信号中对应的语音信号,以得到对第1个第二语音信号进行去重操作后的语音信号(即第一去重语音信号)。Further, the speech signal corresponding to the first deduplication text in the first second speech signal can be determined, so as to obtain the speech signal after the deduplication operation is performed on the first second speech signal (that is, the first deemphasis speech signal ).
以上述第1个第二语音信号为“小A,小A,请打开空调”举例来说,假设“小A,小A,请打开空调”与第1个第二语音信号中语音信号片段之间的对应关系如表1所示。Taking the above-mentioned first second voice signal as "Little A, little A, please turn on the air conditioner" as an example, assuming "little A, little A, please turn on the air conditioner" and the voice signal segment in the first second voice signal The corresponding relationship between them is shown in Table 1.
表1Table 1
文本text 语音信号片段speech signal segment
小A(第一个)Small A (first) 语音信号片段11Speech Signal Segment 11
小A(第二个)Small A (second) 语音信号片段12Speech signal segment 12
请打开空调please turn on the air conditioner 语音信号片段13Speech signal segment 13
在确定第一去重文本为“小A请打开空调”之后,可以根据第一去重文本“小A请打开空调”在第1个第二语音信号中分别对应的语音信号片段,确定第一去重语音信号。结合表1,将“小A(第一个)”对应的语音信号片段11,以及“请打开空调”对应的语音信号片段13拼接得到第一去重语音信号。After determining that the first deduplication text is "Little A, please turn on the air conditioner", the first deduplication text can be determined according to the corresponding voice signal segments in the first second voice signal of the first deduplication text "Little A, please turn on the air conditioner". De-emphasized speech signals. In combination with Table 1, the speech signal segment 11 corresponding to "little A (the first)" and the speech signal segment 13 corresponding to "please turn on the air conditioner" are concatenated to obtain the first de-emphasis speech signal.
需要说明的是,本例子中,由于第一去重文本中“小A”在第1个第二语音信号中对应的语音信号片段可以是语音信号片段11,或者是语音信号片段12,则在拼接时,既可以选择语音信号片段11与语音信号片段13拼接得到第一去重语音信号,也可以选择语音信号片段12与语音信号片段13拼接得到第一去重语音信号。It should be noted that, in this example, since the speech signal segment corresponding to "little A" in the first second speech signal in the first deduplication text can be the speech signal segment 11 or the speech signal segment 12, then in When splicing, either the speech signal segment 11 and the speech signal segment 13 can be selected to be spliced to obtain the first de-emphasis speech signal, or the speech signal segment 12 and the speech signal segment 13 can be selected to be spliced to obtain the first de-emphasis speech signal.
还需要说明的是,上述仅仅是示例性说明如何根据去重文本从第二语音信号中获取去重语音信号,在对第二语音信号按照文本进行划分时,还可以是如表2所示。当然,还可以将第二语音信号划分为其他方式,本申请并不限定。It should also be noted that the above is only an example of how to obtain the de-emphasized voice signal from the second voice signal according to the de-emphasized text. When the second voice signal is divided according to text, it may also be as shown in Table 2. Certainly, the second voice signal may also be divided into other manners, which are not limited in this application.
表2Table 2
文本text 语音信号片段speech signal segment
小A(第一个)Small A (first) 语音信号片段11Speech Signal Segment 11
小A(第二个)Small A (second) 语音信号片段12Speech signal segment 12
请打开please open 语音信号片段14Speech signal segment 14
空调air conditioner 语音信号片段15Speech signal segment 15
步骤803,若确定第一去重语音信号符合更新条件,则执行步骤807。若确定第一去重语音信号不符合更新条件,则执行步骤804。In step 803, if it is determined that the first de-emphasized speech signal meets the update condition, then step 807 is executed. If it is determined that the first de-emphasized speech signal does not meet the update condition, step 804 is performed.
步骤804,获取第2个第二语音信号。Step 804, acquiring the second second voice signal.
步骤805,根据第2个第二语音信号、第一去重文本,确定第1个第二语音信号和第2个第二语音信号共同对应的去重语音信号。Step 805, according to the second second voice signal and the first de-emphasis text, determine the de-emphasis voice signal corresponding to the first second voice signal and the second second voice signal.
需要说明的是,由于在获取第2个第二语音信号之前,已经获取到了第1个第二语音信号,并对第1个第二语音信号进行处理得到了第一去重文本,第一去重文本可以作为第2个第二语音信号的历史去重文本。在对第2个第二语音信号进行ASR处理,并得到第2 个第二语音信号对应的文本之后,根据历史去重文本(第一去重文本)对该第2个第二语音信号对应的文本进行去重操作,得到第1个第二语音信号和第2个第二语音信号共同对应去重文本(可称为第二去重文本)。It should be noted that, since the first second speech signal has been obtained before the second second speech signal is obtained, and the first second speech signal is processed to obtain the first deduplication text, the first deduplication The repeated text may be used as the historical deduplicated text of the second second speech signal. After ASR processing is performed on the second second speech signal, and the text corresponding to the second second speech signal is obtained, the text corresponding to the second second speech signal is deduplicated according to the history (first deduplication text). The deduplication operation is performed on the text to obtain the deduplication text (which may be referred to as the second deduplication text) corresponding to the first second voice signal and the second second voice signal.
一种可选方式中,可以是先将第2个第二语音信号对应的文本执行去重操作之后,然后将去重操作之后的第2个第二语音信号对应的文本与第一去重文本进行比对,进而将去重操作之后的第2个第二语音信号对应的文本中,与第一去重文本中重复的文本删掉,将得到的文本拼接至第一去重文本,得到第二去重文本。In an optional manner, it is possible to perform a deduplication operation on the text corresponding to the second second voice signal first, and then combine the text corresponding to the second second voice signal after the deduplication operation with the first deduplication text Carry out comparison, and then in the text corresponding to the second second voice signal after the deduplication operation, delete the text repeated with the first deduplication text, splice the obtained text into the first deduplication text, and obtain the first deduplication text Two deduplicated text.
示例性的,比如第一去重文本为“小A请打开空调”。第2个第二语音信号是“小A,小A,空调开高点”,对第2个第二语音信号执行ASR,得到文本“小A小A空调开高点”,执行去重操作之后得到文本“小A空调开高点”。随后,根据第一去重文本“小A请打开空调”,以及“小A空调开高点”,得到第二去重文本为“小A请打开空调高点”。Exemplarily, for example, the first deduplication text is "little A, please turn on the air conditioner". The second second voice signal is "Little A, little A, the high point of the air conditioner", execute ASR on the second second voice signal, and get the text "Little A, small A, the high point of the air conditioner", after performing the deduplication operation Get the text "Little A's air conditioner turns on high". Subsequently, according to the first deduplication text "little A, please turn on the air conditioner" and "little A's air conditioner at a high point", the second deduplication text is "little A, please turn on the air conditioner at a high point".
再一种可选方式中,可以是先将第2个第二语音信号对应的文本拼接至第一去重文本之后,然后对拼接之后的文本进行去重操作,得到第二去重文本。In yet another optional manner, the text corresponding to the second second speech signal may be spliced to the first deduplication text, and then the deduplication operation is performed on the spliced text to obtain the second deduplication text.
示例性的,比如第一去重文本为“小A请打开空调”。第2个第二语音信号是“小A,小A,空调开高点”,对第2个第二语音信号执行ASR,得到文本“小A小A空调开高点”。将“小A小A空调开高点”拼接至第一去重文本为“小A请打开空调”之后,得到“小A请打开空调小A小A空调开高点”,进而执行去重操作,得到“小A请打开空调高点”。Exemplarily, for example, the first deduplication text is "little A, please turn on the air conditioner". The second second voice signal is "little A, little A, the air conditioner is turned on high", and ASR is performed on the second second voice signal to obtain the text "little A, small A, the air conditioner is turned on high". After splicing "little A, small A, the high point of the air conditioner" to the first deduplication text as "little A, please turn on the air conditioner", get "little A, please turn on the air conditioner, small A, small A, the high point of the air conditioner", and then perform the deduplication operation , get "Little A, please turn on the high point of the air conditioner".
当然,还可能会有其他去重操作和拼接操作的方式,本申请不再一一列举。Of course, there may be other ways of deduplication operation and splicing operation, and this application will not list them one by one.
在获取第二去重文本之后,可以确定第二去重文本在第1个第二语音信号中对应的语音信号,以及在第2个第二语音信号中对应的语音信号,以得到对第1个第二语音信号和第2个第二语音信号进行去重操作和拼接操作之后的语音信号(即第二去重语音信号)。After the second deduplication text is obtained, the speech signal corresponding to the second deduplication text in the first second speech signal and the speech signal corresponding to the second second speech signal can be determined, so as to obtain the first The second voice signal and the second second voice signal are de-emphasized and spliced to the voice signal (that is, the second de-emphasized voice signal).
以上述第1个第二语音信号为“小A,小A,请打开空调”,第2个第二语音信号是“小A,小A,空调开高点”举例来说。假设“小A,小A,请打开空调”与第1个第二语音信号中语音信号片段之间的对应关系如表1所示。“小A,小A,空调开高点”与第2个第二语音信号中语音信号片段之间的对应关系如表3所示。For example, the first second voice signal above is "Little A, little A, please turn on the air conditioner", and the second second voice signal is "Little A, little A, turn on the air conditioner". Assume that the corresponding relationship between "little A, little A, please turn on the air conditioner" and the speech signal segment in the first second speech signal is shown in Table 1. Table 3 shows the corresponding relationship between "small A, small A, high point of air conditioner" and the speech signal segment in the second second speech signal.
表3table 3
文本text 语音信号片段speech signal segment
小A(第一个)Small A (first) 语音信号片段21Speech signal segment 21
小A(第二个)Small A (second) 语音信号片段22Speech signal segment 22
空调开air conditioner on 语音信号片段23Speech signal segment 23
高点High Point 语音信号片段24Speech signal segment 24
在确定第二去重文本为“小A请打开空调高点”之后,可以根据第二去重文本“小A请打开空调高点”分别在第1个第二语音信号中对应的语音信号片段,以及在第2个第二语音信号中对应的语音信号片段,确定第二去重语音信号。比如,将“小A(第一个)”对应的语音信号片段11,以及“请打开空调”对应的语音信号片段13,以及“高点”对应的语音信号片段24,拼接为第二去重语音信号。当然,确定出的第二去重语音信号,还可以是由语音信号片段12、语音信号片段13和语音信号片段24拼接而成。After determining that the second deduplication text is "Little A, please turn on the high point of the air conditioner", the corresponding voice signal segment in the first second voice signal can be selected according to the second deduplication text "Please turn on the high point of the air conditioner" , and the corresponding speech signal segment in the second second speech signal to determine the second de-emphasis speech signal. For example, the voice signal segment 11 corresponding to "little A (the first)", the voice signal segment 13 corresponding to "please turn on the air conditioner", and the voice signal segment 24 corresponding to "high point" are spliced into the second deduplication voice signal. Of course, the determined second de-emphasis voice signal may also be formed by concatenating the voice signal segment 12 , the voice signal segment 13 and the voice signal segment 24 .
在获取到第二去重语音信号之后,可以确定第二去重语音信号是否符合更新条件(即再次执行步骤803中的判断条件),若是,则执行步骤807。After the second de-emphasis speech signal is acquired, it may be determined whether the second de-emphasis speech signal meets the update condition (that is, the judging condition in step 803 is performed again), and if so, step 807 is performed.
若确定第二去重语音信号不符合更新条件,则继续获取第3个第二语音信号,对第3个第二语音信号进行ASR处理,得到第3个第二语音信号对应的文本之后,根据历史去重文本(第二去重文本)对该第3个第二语音信号对应的文本进行去重操作,得到第1个第二语音信号至第3个第二语音信号共同对应去重文本(可称为第三去重文本)。根据第三去重文本,从第1个第二语音信号至第3个第二语音信号中确定出去重语音信号(可称为第三去重语音信号)。确定第三去重语音信号是否符合更新条件。If it is determined that the second de-emphasis voice signal does not meet the update condition, then continue to obtain the third second voice signal, perform ASR processing on the third second voice signal, and after obtaining the text corresponding to the third second voice signal, according to The historical deduplication text (the second deduplication text) performs a deduplication operation on the text corresponding to the 3rd second voice signal, and obtains the deduplication text corresponding to the 1st second voice signal to the 3rd second voice signal ( may be referred to as the third deduplication text). According to the third de-emphasis text, a de-emphasis speech signal (may be referred to as a third de-emphasis speech signal) is determined from the first second speech signal to the third second speech signal. Determine whether the third de-emphasis voice signal meets the update condition.
循环上述操作,可执行步骤806,获取第i个第二语音信号,根据第i个第二语音信号、第i个第二语音信号的历史去重文本,确定第1个第二语音信号至第i个第二语音信号共同对应的去重语音信号,其中历史去重文本是根据第1个第二语音信号至第i-1个第二语音信号分别对应的文本确定的。直至得到的去重语音信号符合更新条件。By repeating the above operations, step 806 can be performed to obtain the i-th second voice signal, and determine the first to second voice signals from the first second voice signal to the i-th second voice signal according to the i-th second voice signal and the historical deduplication text The de-emphasis voice signals corresponding to the i second voice signals, wherein the historical de-emphasis text is determined according to the texts respectively corresponding to the 1st second voice signal to the i-1th second voice signal. Until the obtained de-emphasis speech signal meets the update condition.
为更好的解释本申请实施例,结合表4解释说明。假设更新条件为去重语音信号的时长大于5s(即时长阈值是5s),去重语音信号对应的非重复文本中文字个数大于12(即字数阈值是12)。在更新声纹过程中:In order to better explain the embodiment of the present application, it is explained in conjunction with Table 4. Assume that the update condition is that the duration of the de-emphasis speech signal is greater than 5s (the instant length threshold is 5s), and the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than 12 (that is, the word count threshold is 12). During the voiceprint update process:
(1)获取第1个第二语音信号为“你好小A”,执行ASR得到第1个第二语音信号对应的文本“你好小A”,由于此文本中不存在重复文本,即第一去重文本为“你好小A”,字数为4,第一去重语音信号的时长为1.5s。此时,第一去重文本的字数不大于12,且第一去重语音信号的时长不大于5s,可以进一步获取第2个第二语音信号。(1) Obtain the first second voice signal as "Hello Little A", execute ASR to get the text "Hello Little A" corresponding to the first second voice signal, since there is no repeated text in this text, that is, the first The first deduplication text is "Hello Little A", the number of characters is 4, and the duration of the first deduplication voice signal is 1.5s. At this time, the number of words in the first de-emphasis text is not more than 12, and the duration of the first de-emphasis voice signal is not more than 5s, and the second second voice signal can be further obtained.
(2)获取第2个第二语音信号为“打开空调”,执行ASR得到第2个第二语音信号对应的文本“打开空调”,根据历史去重文本“你好小A”对“打开空调”进行去重操作和拼接操作,得到第二去重文本“你好小A打开空调”,字数为8,对应的第二去重语音信号的时长为3s。此时,第二去重文本的字数不大于12,且第二去重语音信号的时长不大于5s,则可以进一步获取第3个第二语音信号。(2) Obtain the second second voice signal as "turn on the air conditioner", execute ASR to get the text "turn on the air conditioner" corresponding to the second second voice signal, and deduplicate the text "Hello little A" to "turn on the air conditioner" according to the history " Perform the deduplication operation and splicing operation to obtain the second deduplication text "Hello little A turns on the air conditioner", the number of words is 8, and the corresponding second deduplication voice signal has a duration of 3s. At this time, if the number of words in the second de-emphasis text is not greater than 12, and the duration of the second de-emphasis voice signal is not more than 5s, then the third second voice signal can be further obtained.
(3)获取第3个第二语音信号为“空调开高点”,执行ASR得到第3个第二语音信号对应的文本“空调开高点”,根据历史去重文本“你好小A打开空调”对“空调开高点”进行去重操作和拼接操作,得到第三去重文本“你好小A打开空调高点”,字数为10,对应的第三去重语音信号的时长为3.5s。此时,第三去重文本的字数不大于12,且第三去重语音信号的时长不大于5s,则可以进一步获取第4个第二语音信号。(3) Obtain the third second voice signal as "air conditioner opening high point", execute ASR to get the text "air conditioner opening high point" corresponding to the third second voice signal, and deduplicate the text "Hello little A open" according to the history "Air conditioner" performs de-duplication and splicing operations on the "air-conditioning high point" to obtain the third de-duplication text "Hello little A turns on the high point of the air-conditioning", the number of words is 10, and the corresponding third de-duplication voice signal has a duration of 3.5 s. At this time, the number of characters of the third de-emphasis text is not more than 12, and the duration of the third de-emphasis voice signal is not more than 5s, then the fourth second voice signal can be further obtained.
(4)获取第4个第二语音信号为“你好小A”,执行ASR得到第4个第二语音信号对应的文本“你好小A”,根据历史去重文本“你好小A打开空调高点”对“你好小A”进行去重操作和拼接操作,得到第四去重文本“你好小A打开空调高点”,字数为10,对应的第四去重语音信号的时长为3.5s。此时,第四去重文本的字数不大于12,且第四去重语音信号的时长不大于5s,则可以进一步获取第5个第二语音信号。(4) Obtain the 4th second voice signal as "Hello Little A", execute ASR to get the text "Hello Little A" corresponding to the 4th second voice signal, and open the duplicate text "Hello Little A" according to the history Air-conditioning high point" performs de-duplication and splicing operations on "Hello little A", and obtains the fourth de-duplication text "Hello little A turns on the air-conditioning high point", the number of words is 10, and the corresponding duration of the fourth de-duplication voice signal It is 3.5s. At this time, if the number of characters of the fourth de-emphasis text is not greater than 12, and the duration of the fourth de-emphasis voice signal is not more than 5s, then the fifth second voice signal can be further obtained.
(5)获取第5个第二语音信号为“打开天窗”,执行ASR得到第5个第二语音信号对应的文本“打开天窗”,根据历史去重文本“你好小A打开空调高点”对“打开天窗”进行去重操作和拼接操作,得到第五去重文本“你好小A打开空调高点天窗”,字数为12,对应的第五去重语音信号的时长为4s。此时,第五去重文本的字数不大于12,第五去重语音信号的时长不大于5s,则可以进一步获取第6个第二语音信号。(5) Obtain the fifth second voice signal as "open the sunroof", execute ASR to get the text "open the sunroof" corresponding to the fifth second voice signal, and deduplicate the text "Hello little A, turn on the high point of the air conditioner" according to the history Perform deduplication and splicing operations on "open the sunroof", and obtain the fifth deduplication text "Hello little A opens the high-point sunroof of the air conditioner", the number of words is 12, and the duration of the corresponding fifth deduplication voice signal is 4s. At this time, the number of characters of the fifth de-emphasis text is not more than 12, and the duration of the fifth de-emphasis voice signal is not more than 5s, then the sixth second voice signal can be further obtained.
(6)获取第6个第二语音信号为“播放音乐”,执行ASR得到第6个第二语音信号对应的文本“播放音乐”,根据历史去重文本“你好小A打开空调高点天窗”对“播放音乐”进行去重操作和拼接操作,得到第六去重文本“你好小A打开空调高点天窗播放音乐”, 字数为16,对应的第六去重语音信号的时长为5.5s。此时,第六去重文本的字数大于12,第六去重语音信号的时长大于5s。(6) Obtain the 6th second voice signal as "play music", execute ASR to get the text "play music" corresponding to the 6th second voice signal, and deduplicate the text "Hello, little A, open the air-conditioning high point sunroof" according to the history "Deduplication and splicing operations are performed on "playing music" to obtain the sixth deduplication text "Hello little A opens the air-conditioning high-point sunroof to play music", the number of words is 16, and the corresponding sixth deduplication voice signal has a duration of 5.5 s. At this time, the number of characters of the sixth de-emphasis text is greater than 12, and the duration of the sixth de-emphasis voice signal is greater than 5s.
如此,确定出第六去重语音信号符合更新条件,根据第六去重语音信号执行步骤807。In this way, it is determined that the sixth de-emphasis speech signal meets the update condition, and step 807 is performed according to the sixth de-emphasis speech signal.
表4Table 4
 the 第二语音信号second voice signal 去重文本deduplicated text 字数word count 时长duration
11 你好小AHello Little A 你好小AHello Little A 44 1.5s1.5s
22 打开空调on the aircon 你好小A打开空调Hello little A, turn on the air conditioner 88 3s3s
33 空调开高点air conditioner high 你好小A打开空调高点Hello little A, turn on the air conditioner higher 1010 3.5s3.5s
44 你好小AHello Little A 你好小A打开空调高点Hello little A, turn on the air conditioner higher 1010 3.5s3.5s
55 打开天窗open the sunroof 你好小A打开空调高点天窗Hello little A, turn on the air-conditioning high point sunroof 1212 4s4s
66 播放音乐play music 你好小A打开空调高点天窗播放音乐Hello Little A, turn on the air conditioner and play music on the high sunroof 1616 5.5s5.5s
步骤807,根据符合更新条件的去重语音信号,确定第一用户的第三声纹。Step 807: Determine the third voiceprint of the first user according to the de-emphasized voice signal meeting the update condition.
示例性的,根据符合更新条件的去重语音信号,执行图1c示例性示出的声纹提取流程,得到第一用户的第三声纹。Exemplarily, the voiceprint extraction process exemplarily shown in FIG. 1c is executed according to the de-emphasized voice signal meeting the update condition, to obtain the third voiceprint of the first user.
步骤808,根据第三声纹,更新第一用户的基准声纹。Step 808, update the reference voiceprint of the first user according to the third voiceprint.
在一种可选实现方式中,可以是主动更新第一用户的基准声纹,即将第三声纹替换原有的基准声纹。在另一种可选实现方式中,可以提示第一用户是否更新基准声纹。In an optional implementation manner, the reference voiceprint of the first user may be actively updated, that is, the original reference voiceprint is replaced by the third voiceprint. In another optional implementation manner, the first user may be prompted whether to update the reference voiceprint.
在提示第一用户是否更新基准声纹的方式中,可以结合图2中例子解释,第一用户为用户A(车主),车载设备确定用户A的声纹发生长久改变,则可以在显示界面中提示用户A,是否需要自动更新基准声纹。In the way of prompting the first user whether to update the reference voiceprint, it can be explained in conjunction with the example in Figure 2, the first user is user A (car owner), and the vehicle-mounted device determines that the voiceprint of user A has changed permanently, then it can display in the display interface Prompt user A whether to automatically update the reference voiceprint.
示例性的,如图9中,车载设备在显示界面中显示提示信息“检测到您的声纹发生改变,是否将检测到的声纹更新原有声纹?”。若用户A点击“好的”,则车载设备可以将获取到的第三声纹替换原有的基准声纹。若用户A点击“不,我要自行更新”,则车载设备可以进一步在显示界面中显示一段预设文字,并提示用户A读该段预设文字,获取到用户A当前的语音信号,基于该语音信号提取声纹,得到用户A的基准声纹,将用户A的基准声纹存储。当然,在一些其他实施例中,也可以在根据第一时间窗中第一声纹的验证结果,确定需要更新基准声纹时,在显示界面中提示用户A自行更新基准声纹。Exemplarily, as shown in FIG. 9 , the in-vehicle device displays a prompt message "It is detected that your voiceprint has changed, do you want to update the detected voiceprint to the original voiceprint?" on the display interface. If user A clicks "OK", the in-vehicle device can replace the original reference voiceprint with the obtained third voiceprint. If user A clicks "No, I want to update by myself", the vehicle-mounted device can further display a preset text on the display interface, and prompt user A to read the preset text, and obtain the current voice signal of user A, based on the The voiceprint is extracted from the voice signal to obtain the reference voiceprint of user A, and the reference voiceprint of user A is stored. Of course, in some other embodiments, when it is determined that the reference voiceprint needs to be updated according to the verification result of the first voiceprint in the first time window, the user A may be prompted on the display interface to update the reference voiceprint by himself.
在更新基准声纹之后,可以将第二时间窗进行滑动操作,得到滑动后的第二时间窗。其中滑动后的第二时间窗的起始点不早于基准声纹更新的时间点;或者,滑动后的第二时间窗的起始点不早于第二语音信号的获取时间点。After the reference voiceprint is updated, the second time window can be slid to obtain a slid second time window. Wherein, the starting point of the second time window after sliding is not earlier than the time point of updating the reference voiceprint; or, the starting point of the second time window after sliding is not earlier than the time point of acquiring the second voice signal.
示例性的,第二时间窗的起始点可以在第三时间窗的终止点之后,或者与第三时间窗的终止点重合。如图5中(e),滑动后的第二时间窗的起始点与第三时间窗的终止点重合。可以进一步在更新基准声纹之后的第二时间窗(即滑动后的第二时间窗)中,获取滑动后的第二时间窗中的一个或多个第二声纹,根据一个或多个第二声纹分别与更新后的基准声纹的相似度,确定滑动后的第二时间窗的平均相似度。Exemplarily, the starting point of the second time window may be after the ending point of the third time window, or coincide with the ending point of the third time window. As shown in (e) of FIG. 5 , the starting point of the second time window after sliding coincides with the ending point of the third time window. One or more second voiceprints in the second time window after sliding may be further obtained in the second time window after updating the reference voiceprint (ie, the second time window after sliding), according to one or more first time windows The similarities between the two voiceprints and the updated reference voiceprint respectively determine the average similarity in the second time window after sliding.
进一步的,在确定出第二时间窗的平均相似度之后,滑动第一时间窗,滑动后的第一时间窗的终止点晚于滑动后的第二时间窗的起始点。示例性的,滑动后的第一时间窗的起始点可以在滑动后的第二时间窗的终止点之后,或者与滑动后的第二时间窗的终止点重合。 如图5中(e),滑动后的第一时间窗的起始点与滑动后的第二时间窗的终止点重合。滑动后的第一时间窗具体为t8-t9,其中t8-t9二者间隔第一预设时长。Further, after the average similarity of the second time window is determined, the first time window is slid, and the end point of the slid first time window is later than the start point of the slid second time window. Exemplarily, the starting point of the first time window after sliding may be after the ending point of the second time window after sliding, or coincide with the ending point of the second time window after sliding. As shown in (e) of FIG. 5 , the starting point of the first time window after sliding coincides with the ending point of the second time window after sliding. The first time window after sliding is specifically t8-t9, wherein the interval between t8-t9 is a first preset time length.
在滑动后的第一时间窗中获取一个或多个第一声纹,根据一个或多个第一声纹分别与更新后的基准声纹的相似度、滑动后的第二时间窗的平均相似度,确定一个或多个第一声纹的验证结果,进而确定是否再次更新基准声纹。Obtain one or more first voiceprints in the first time window after sliding, according to the similarity between one or more first voiceprints and the updated reference voiceprint and the average similarity of the second time window after sliding Determine the verification results of one or more first voiceprints, and then determine whether to update the reference voiceprint again.
结合图10示例性示出的再一种声纹管理方法的流程示意图,解释说明:In conjunction with the schematic flow chart of another voiceprint management method shown in Fig. 10, the explanation is as follows:
获取第一用户的语音信号,对第一用户的语音信号进行前处理,得到语音信号的音频特征信息。根据音频特征信息执行音频特征提取,确定第一用户的声纹。执行后处理,得到最终的第一用户的声纹。一种情况中,将该第一用户的声纹作为基准声纹进行注册或者更新。另一种情况中,可以确定当前时间点位于第一时间窗中,或者位于第二时间窗中。The voice signal of the first user is acquired, and the voice signal of the first user is pre-processed to obtain audio feature information of the voice signal. Perform audio feature extraction according to the audio feature information to determine the voiceprint of the first user. Perform post-processing to obtain the final voiceprint of the first user. In one case, register or update the voiceprint of the first user as a reference voiceprint. In another case, it may be determined that the current time point is located in the first time window, or in the second time window.
在当前时间点位于第一时间窗时,将该第一用户的声纹(即第一声纹)与基准声纹作比较,得到第一声纹与基准声纹之间的相似度,若该相似度大于第二时间窗的平均相似度,则确定该第一声纹通过验证。确定第一时间窗中通过验证的第一声纹的比率(即第一时间窗中的达标率),若达标率大于比率阈值,则滑动第一时间窗;若达标率小于或等于比率阈值,则启动更新基准声纹的流程。When the current time point is in the first time window, compare the voiceprint of the first user (that is, the first voiceprint) with the reference voiceprint to obtain the similarity between the first voiceprint and the reference voiceprint, if the If the similarity is greater than the average similarity in the second time window, it is determined that the first voiceprint has passed the verification. Determine the ratio of the verified first voiceprint in the first time window (i.e. the compliance rate in the first time window), if the compliance rate is greater than the ratio threshold, then slide the first time window; if the compliance rate is less than or equal to the ratio threshold, Then start the process of updating the reference voiceprint.
在当前时间点位于第二时间窗时,将该第一用户的声纹(即第二声纹)与基准声纹作比较,得到第二声纹与基准声纹之间的相似度。确定第二时间窗中,多个第二声纹与基准声纹之间相似度的平均值,作为平均相似度。When the current time point is within the second time window, the voiceprint of the first user (that is, the second voiceprint) is compared with the reference voiceprint to obtain the similarity between the second voiceprint and the reference voiceprint. In the second time window, an average value of the similarities between the multiple second voiceprints and the reference voiceprint is determined as the average similarity.
需要补充的是,本申请中涉及到多个阈值,比如比率阈值,相似度阈值,差值阈值,次数阈值,时长阈值,字数阈值等。在根据阈值的判定过程中,可以是在大于或等于该阈值的时候,对应于第一结果,可以是在小于该阈值的时候,对应于第二结果;还可以是在大于该阈值的时候,对应于第一结果,可以是在小于或等于该阈值的时候,对应于第二结果,本申请不做限定。比如在确定是否更新第一用户的基准声纹的过程中,虽然实施例中指出的是在比率大于比率阈值时,无需更新第一用户的基准声纹(即第一结果);在比率小于或等于比率阈值时,更新第一用户的基准声纹(即第二结果),但本申请中同样可以是在比率大于或等于比率阈值时,无需更新第一用户的基准声纹(即第一结果);在比率小于比率阈值时,更新第一用户的基准声纹(即第二结果)。It should be added that multiple thresholds are involved in this application, such as ratio threshold, similarity threshold, difference threshold, number of times threshold, duration threshold, word count threshold, etc. In the judgment process according to the threshold, when it is greater than or equal to the threshold, it corresponds to the first result; when it is less than the threshold, it corresponds to the second result; it can also be when it is greater than the threshold, Corresponding to the first result may be when it is less than or equal to the threshold, corresponding to the second result, which is not limited in this application. For example, in the process of determining whether to update the first user's reference voiceprint, although it is pointed out in the embodiment that when the ratio is greater than the ratio threshold, there is no need to update the first user's reference voiceprint (i.e. the first result); When it is equal to the ratio threshold, update the first user’s reference voiceprint (i.e. the second result), but in this application it is also possible that when the ratio is greater than or equal to the ratio threshold, there is no need to update the first user’s reference voiceprint (i.e. the first result). ); when the ratio is less than the ratio threshold, update the first user's reference voiceprint (that is, the second result).
还需要补充的是,在声纹管理设备获取第二语音信号对应的文本的过程中,还可以是由语义理解设备执行ASR得到第二语音信号对应的文本,然后声纹管理设备从语义理解设备中获取到该第二语音信号对应的文本,本申请不限定。What needs to be added is that in the process of the voiceprint management device acquiring the text corresponding to the second voice signal, the semantic understanding device may execute ASR to obtain the text corresponding to the second voice signal, and then the voiceprint management device obtains the text corresponding to the second voice signal from the semantic understanding device The text corresponding to the second voice signal obtained in the present application is not limited.
应理解,上述技术方案中,预先确定第一用户的基准声纹,然后在第二时间窗中获取第一用户的第二声纹,确定第二声纹与基准声纹的相似度,根据第二时间窗中的多个相似度,确定第二时间窗的平均相似度。由于第二时间窗在存储基准声纹的时间点之后,即用户在注册基准声纹的时间点至第二时间窗的终止点的时间段中,用户的声纹是处于平稳变化的,可以基于该处于平稳变化的第一用户的声纹,得到基准参数(即基准声纹和第二时间窗的平均相似度)。在后续判定用户声纹是否发生长久改变的过程中,可以基于第一时间窗中用户的第一声纹,确定第一声纹与基准声纹之间的相似度,将该相似度与第二时间窗的平均相似度相比较,确定第一声纹的验证结果,即确定第一时间窗中第一声纹是否与第二时间窗中第二声纹类似,进而确定第二时间窗的起始点至第一时间窗的终止点的时间段中,用户的声纹是否发生了长久改变。It should be understood that in the above technical solution, the reference voiceprint of the first user is determined in advance, and then the second voiceprint of the first user is obtained in the second time window, and the similarity between the second voiceprint and the reference voiceprint is determined. Multiple similarities in two time windows, determining an average similarity in the second time window. Since the second time window is after the time point when the reference voiceprint is stored, that is, during the time period from the time point when the user registers the reference voiceprint to the end point of the second time window, the user's voiceprint is in a steady change, which can be based on The voiceprint of the first user that is changing steadily is used to obtain a reference parameter (that is, the average similarity between the reference voiceprint and the second time window). In the subsequent process of determining whether the user's voiceprint has undergone permanent changes, the similarity between the first voiceprint and the reference voiceprint can be determined based on the user's first voiceprint in the first time window, and the similarity can be compared with the second Compare the average similarity of the time window to determine the verification result of the first voiceprint, that is, determine whether the first voiceprint in the first time window is similar to the second voiceprint in the second time window, and then determine the start of the second time window During the time period from the start point to the end point of the first time window, whether the user's voiceprint has undergone a long-term change.
而且,本申请中确定第一时间窗中多个第一声纹的验证结果,确定第一时间窗中通过验证的第一声纹的比率(即第一时间窗中达标率),根据第一时间窗中达标率确定用户的声纹是否发生了长久改变,避免因为第一时间窗中用户偶然原因导致误判断,有助于提高判定的准确性。Moreover, in this application, the verification results of multiple first voiceprints in the first time window are determined, and the ratio of the first voiceprints that pass verification in the first time window (that is, the compliance rate in the first time window) is determined. According to the first The compliance rate in the time window determines whether the user's voiceprint has changed for a long time, avoiding misjudgment caused by the user's accidental reasons in the first time window, and helping to improve the accuracy of the judgment.
进一步的,在不更新基准声纹时,可以滑动第一时间窗,确定滑动后的第一时间窗中,通过验证的第一声纹的比率(即滑动后的第一时间窗中达标率),根据滑动后的第一时间窗中达标率确定用户的声纹是否发生了长久改变,有助于及时发现第一用户的声纹发生长久变化,更新第一用户的注册声纹。Further, when the reference voiceprint is not updated, the first time window can be slid to determine the ratio of the verified first voiceprint in the first time window after sliding (that is, the compliance rate in the first time window after sliding) , according to the compliance rate in the first time window after sliding, it is determined whether the voiceprint of the user has undergone a permanent change, which is helpful to timely discover the permanent change of the voiceprint of the first user and update the registered voiceprint of the first user.
进一步的,在更新基准声纹时,获取第二语音信号,根据获取的第二语音信号确定第一用户的声纹,无需与用户交互,做到用户无感知,有助于提高用户体验。而且无需预先存储第一时间窗中多个第一声纹,有助于减少存储量。本申请中也无需存储第二时间窗中多个第二声纹,仅需要保存有第二时间窗的平均相似度即可,进一步减少存储量。Further, when updating the reference voiceprint, the second voice signal is acquired, and the voiceprint of the first user is determined according to the acquired second voice signal, without interacting with the user, so that the user has no perception, which helps to improve user experience. Moreover, there is no need to pre-store multiple first voiceprints in the first time window, which helps to reduce the amount of storage. In this application, there is no need to store multiple second voiceprints in the second time window, and only the average similarity of the second time window needs to be saved, further reducing the storage capacity.
本文中描述的各个实施例可以为独立的方案,也可以根据内在逻辑进行组合,这些方案都落入本申请的保护范围中。The various embodiments described herein may be independent solutions, or may be combined according to internal logic, and these solutions all fall within the protection scope of the present application.
可以理解的是,上述各个方法实施例中,由声纹管理装置实现的方法和操作,也可以由可用于声纹管理装置的部件(例如芯片或者电路)实现。It can be understood that, in the above method embodiments, the methods and operations implemented by the voiceprint management device may also be implemented by components (such as chips or circuits) that can be used in the voiceprint management device.
本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。另外,在本申请各个实施例中的各功能模块可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。The division of modules in the embodiment of the present application is schematic, and is only a logical function division, and there may be other division methods in actual implementation. In addition, each functional module in each embodiment of the present application may be integrated into one processor, or physically exist separately, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.
基于上述内容和相同构思,图11和图12为本申请示例性提供的可能的声纹管理装置的结构示意图。这些声纹管理装置可以用于实现上述方法实施例的功能,因此也能实现上述方法实施例所具备的有益效果。Based on the above content and the same idea, Fig. 11 and Fig. 12 are schematic structural diagrams of possible voiceprint management devices exemplarily provided by the present application. These voiceprint management devices can be used to realize the functions of the above-mentioned method embodiments, and therefore can also realize the beneficial effects possessed by the above-mentioned method embodiments.
如图11所示,本申请提供的声纹管理装置,包括:获取模块1101和处理模块1102。As shown in FIG. 11 , the voiceprint management device provided by this application includes: an acquisition module 1101 and a processing module 1102 .
示例性的,获取模块1101可用于执行图6、图8或图10相关方法实施例中声纹管理装置的获取功能,比如用于执行获取第一语音信号的步骤,或者用于执行获取第二语音信号的步骤等。示例性的,处理模块1102可用于执行图6、图8或图10相关方法实施例中声纹管理装置的处理功能,比如用于执行确定第一声纹的验证结果的步骤,或者是否更新基准声纹的判断步骤,或者用于执行更新基准声纹的步骤等。Exemplarily, the acquisition module 1101 can be used to perform the acquisition function of the voiceprint management device in the relevant method embodiments in FIG. 6, FIG. 8 or FIG. Steps for speech signals, etc. Exemplarily, the processing module 1102 can be used to execute the processing functions of the voiceprint management device in the relevant method embodiments in FIG. 6, FIG. 8 or FIG. A voiceprint judgment step, or a step for updating a reference voiceprint, etc.
在一种可能的实现方式中,用于获取第一时间窗中第一声纹的验证结果,其中,第一声纹的验证结果是根据第一声纹、基准声纹、第二时间窗的平均相似度确定的,第一声纹是根据在第一时间窗中获取的第一用户的第一语音信号确定的,第二时间窗的平均相似度用于指示在第二时间窗中获取的第一用户的声纹与基准声纹的相似情况,第一时间窗的起始点晚于第二时间窗起始点;处理模块1102用于根据第一时间窗中第一声纹的验证结果,更新基准声纹。In a possible implementation manner, it is used to obtain the verification result of the first voiceprint in the first time window, wherein the verification result of the first voiceprint is based on the results of the first voiceprint, the reference voiceprint, and the second time window. determined by the average similarity, the first voiceprint is determined according to the first voice signal of the first user acquired in the first time window, and the average similarity of the second time window is used to indicate the voiceprint acquired in the second time window In the case where the voiceprint of the first user is similar to the reference voiceprint, the starting point of the first time window is later than the starting point of the second time window; the processing module 1102 is configured to update the voiceprint according to the verification result of the first voiceprint in the first time window Baseline voiceprint.
在一种可能的实现方式中,第一声纹的验证结果是根据第一声纹、基准声纹、第二时间窗的平均相似度确定的,包括:在第一声纹与基准声纹的相似度大于第二时间窗的平均相似度时,第一声纹的验证结果为通过验证。In a possible implementation, the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including: When the similarity is greater than the average similarity of the second time window, the verification result of the first voiceprint is verified.
在一种可能的实现方式中,处理模块1102具体用于:根据第一时间窗中通过验证的第 一声纹的比率,更新基准声纹。In a possible implementation manner, the processing module 1102 is specifically configured to: update the reference voiceprint according to the ratio of the verified first voiceprint in the first time window.
在一种可能的实现方式中,处理模块1102具体用于:在第一时间窗中通过验证的第一声纹的比率小于或等于比率阈值时,更新基准声纹。In a possible implementation manner, the processing module 1102 is specifically configured to: update the reference voiceprint when the ratio of the verified first voiceprint in the first time window is less than or equal to a ratio threshold.
在一种可能的实现方式中,处理模块1102具体用于:控制获取模块1101获取第一用户的第二语音信号,其中,第二语音信号与基准声纹的相似度,与第一时间窗的平均相似度的差值小于差值阈值,第一时间窗的平均相似度用于指示在第一时间窗中获取的第一用户的声纹与基准声纹的相似情况;根据第二语音信号,更新基准声纹。In a possible implementation, the processing module 1102 is specifically configured to: control the acquisition module 1101 to acquire the second voice signal of the first user, wherein the similarity between the second voice signal and the reference voiceprint is the same as that of the first time window. The difference of the average similarity is less than the difference threshold, and the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; according to the second voice signal, Update the baseline voiceprint.
在一种可能的实现方式中,处理模块1102具体用于:根据第二语音信号和第二语音信号对应的文本,获得去重语音信号,其中去重语音信号对应的文本为非重复文本;根据去重语音信号,更新基准声纹。In a possible implementation manner, the processing module 1102 is specifically configured to: obtain the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, wherein the text corresponding to the de-emphasis voice signal is non-repetitive text; Deduplicate the voice signal and update the reference voiceprint.
在一种可能的实现方式中,第二语音信号为一个,处理模块1102具体用于:对第二语音信号对应的文本执行去重操作,得到去重文本;根据去重文本在第二语音信号中对应的语音信号,获得去重语音信号。In a possible implementation manner, there is one second voice signal, and the processing module 1102 is specifically configured to: perform a deduplication operation on the text corresponding to the second voice signal to obtain deduplication text; The corresponding speech signal in the middle is obtained to obtain the de-emphasis speech signal.
在一种可能的实现方式中,第二语音信号为多个,处理模块1102具体用于:针对多个第二语音信号中的第i个第二语音信号:根据历史去重文本,对第i个第二语音信号对应的文本执行去重操作;将去重操作之后的、第i个第二语音信号对应的文本与历史去重文本拼接,得到第1个语音信号至第i个第二语音信号共同对应的去重文本;其中历史去重文本是根据第1个第二语音信号至第i-1个第二语音信号分别对应的文本得到的,i大于1;根据多个第二语音共同对应的去重文本在多个第二语音信号中对应的语音信号,获得去重语音信号。In a possible implementation, there are multiple second voice signals, and the processing module 1102 is specifically configured to: for the ith second voice signal among the multiple second voice signals: The text corresponding to the second voice signal performs a deduplication operation; after the deduplication operation, the text corresponding to the ith second voice signal is spliced with the historical deduplication text to obtain the 1st voice signal to the ith second voice The deduplication text corresponding to the signals; wherein the historical deduplication text is obtained according to the text corresponding to the 1st second voice signal to the i-1th second voice signal, i is greater than 1; according to the common The corresponding de-emphasis text is a corresponding voice signal among the plurality of second voice signals to obtain a de-emphasis voice signal.
在一种可能的实现方式中,去重语音信号的时长大于时长阈值,和/或,去重语音信号对应的非重复文本中文字个数大于字数阈值。In a possible implementation manner, the duration of the de-emphasis speech signal is greater than the duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than the word-number threshold.
在一种可能的实现方式中,处理模块1102还用于:滑动第一时间窗,其中,滑动后的第一时间窗的终止点晚于滑动前的第一时间窗的终止点。In a possible implementation manner, the processing module 1102 is further configured to: slide the first time window, wherein an end point of the first time window after sliding is later than an end point of the first time window before sliding.
在一种可能的实现方式中,第一时间窗的长度是可变的。In a possible implementation manner, the length of the first time window is variable.
在一种可能的实现方式中,处理模块1102还用于:滑动第二时间窗,其中,滑动后的第二时间窗的终止点晚于滑动前的第二时间窗的终止点,且滑动后的第二时间窗的起始点早于滑动后的第一时间窗的起始点。In a possible implementation, the processing module 1102 is further configured to: slide the second time window, where the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the end point of the second time window after sliding is The start point of the second time window of is earlier than the start point of the first time window after sliding.
在一种可能的实现方式中,在获取模块1101获取第一时间窗中第一声纹的验证结果之前,处理模块1102还用于:根据基准声纹和第二时间窗中获取的第一用户的第二声纹,确定第二时间窗中的相似度;根据第二时间窗中的多个相似度,确定第二时间窗的平均相似度。In a possible implementation, before the obtaining module 1101 obtains the verification result of the first voiceprint in the first time window, the processing module 1102 is further configured to: Determine the similarity in the second time window of the second voiceprint; determine the average similarity in the second time window according to the multiple similarities in the second time window.
如图12所示为本申请实施例提供的装置,图12所示的装置可以为图11所示的装置的一种硬件电路的实现方式。该装置可适用于前面所示出的流程图中,执行上述方法实施例。As shown in FIG. 12 , the device provided by the embodiment of the present application is shown. The device shown in FIG. 12 may be a hardware circuit implementation manner of the device shown in FIG. 11 . The apparatus may be applicable to the flow chart shown above to execute the above-mentioned method embodiment.
为了便于说明,图12仅示出了该装置的主要部件。For ease of illustration, only the main components of the device are shown in FIG. 12 .
该声纹管理装置包括:处理器1210和接口1230,可选的,该声纹管理装置还包括存储器1220。接口1230用于实现与其他设备进行通信。The voiceprint management device includes: a processor 1210 and an interface 1230 , and optionally, the voiceprint management device further includes a memory 1220 . The interface 1230 is used to implement communication with other devices.
以上实施例中声纹管理装置执行的方法可以通过处理器1210调用存储器(可以是声纹管理装置中的存储器1220,也可以是外部存储器)中存储的程序来实现。即,声纹管理装置可以包括处理器1210,该处理器1210通过调用存储器中的程序,以执行以上方法实 施例中声纹管理装置执行的方法。这里的处理器可以是一种具有信号的处理能力的集成电路,例如CPU。声纹管理装置可以通过配置成实施以上方法的一个或多个集成电路来实现。例如:一个或多个ASIC,或,一个或多个微处理器DSP,或,一个或者多个FPGA等,或这些集成电路形式中至少两种的组合。或者,可以结合以上实现方式。The methods performed by the voiceprint management device in the above embodiments can be implemented by calling the program stored in the memory (which may be the memory 1220 in the voiceprint management device or an external memory) by the processor 1210. That is, the voiceprint management device may include a processor 1210, and the processor 1210 executes the method performed by the voiceprint management device in the above method embodiments by calling a program in the memory. The processor here may be an integrated circuit with signal processing capabilities, such as a CPU. The voiceprint management device can be realized by one or more integrated circuits configured to implement the above method. For example: one or more ASICs, or one or more microprocessors DSP, or one or more FPGAs, etc., or a combination of at least two of these integrated circuit forms. Alternatively, the above implementation manners may be combined.
具体的,图11中的处理模块1102和获取模块1101的功能/实现过程可以通过图12所示的声纹管理装置中的处理器1210调用存储器1220中存储的计算机执行指令来实现。Specifically, the functions/implementation process of the processing module 1102 and the obtaining module 1101 in FIG. 11 can be realized by calling the computer-executed instructions stored in the memory 1220 by the processor 1210 in the voiceprint management device shown in FIG. 12 .
基于上述内容和相同构思,本申请提供一种计算机程序产品,计算机程序产品包括计算机程序或指令,当计算机程序或指令被计算设备执行时,实现上述方法实施例中的方法。Based on the above content and the same idea, the present application provides a computer program product, the computer program product includes computer programs or instructions, and when the computer program or instructions are executed by a computing device, the methods in the above method embodiments are implemented.
基于上述内容和相同构思,本申请提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序或指令,当计算机程序或指令被计算设备执行时,实现上述方法实施例中的方法。Based on the above content and the same idea, the present application provides a computer-readable storage medium, in which computer programs or instructions are stored, and when the computer programs or instructions are executed by a computing device, the methods in the above-mentioned method embodiments are implemented .
基于上述内容和相同构思,本申请提供一种计算设备,包括处理器,处理器与存储器相连,存储器用于存储计算机程序,处理器用于执行存储器中存储的计算机程序,以使得计算设备执行上述方法实施例中的方法。Based on the above content and the same idea, the present application provides a computing device, including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the computing device performs the above method Methods in the Examples.
基于上述内容和相同构思,本申请实施例提供一种芯片系统,包括:处理器和存储器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该芯片系统实现上述方法实施例中的方法。Based on the above content and the same idea, an embodiment of the present application provides a chip system, including: a processor and a memory, the processor is coupled to the memory, and the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the The chip system implements the methods in the foregoing method embodiments.
可选地,该芯片系统还包括接口电路,该接口电路用于交互代码指令至处理器。Optionally, the chip system further includes an interface circuit for exchanging code instructions to the processor.
可选地,该芯片系统中的处理器可以为一个或多个,该处理器可以通过硬件实现也可以通过软件实现。当通过硬件实现时,该处理器可以是逻辑电路、集成电路等。当通过软件实现时,该处理器可以是一个通用处理器,通过读取存储器中存储的软件代码来实现。Optionally, there may be one or more processors in the chip system, and the processors may be implemented by hardware or by software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented by software, the processor may be a general-purpose processor implemented by reading software codes stored in a memory.
可选地,该芯片系统中的存储器也可以为一个或多个。该存储器可以与处理器集成在一起,也可以和处理器分离设置。示例性的,存储器可以是非瞬时性处理器,例如只读存储器ROM,其可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上。Optionally, there may be one or more memories in the chip system. The memory can be integrated with the processor, or can be set separately from the processor. Exemplarily, the memory may be a non-transitory processor, such as a read-only memory ROM, which may be integrated with the processor on the same chip, or may be respectively provided on different chips.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Apparently, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.

Claims (29)

  1. 一种声纹管理方法,其特征在于,包括:A voiceprint management method, characterized by comprising:
    获取第一时间窗中第一声纹的验证结果,其中,所述第一声纹的验证结果是根据所述第一声纹、基准声纹、第二时间窗的平均相似度确定的,所述第一声纹是根据在所述第一时间窗中获取的第一用户的第一语音信号确定的,所述第二时间窗的平均相似度用于指示在所述第二时间窗中获取的所述第一用户的声纹与所述基准声纹的相似情况,所述第一时间窗的起始点晚于所述第二时间窗起始点;Acquiring the verification result of the first voiceprint in the first time window, wherein the verification result of the first voiceprint is determined according to the average similarity between the first voiceprint, the reference voiceprint, and the second time window, and the The first voiceprint is determined according to the first voice signal of the first user acquired in the first time window, and the average similarity of the second time window is used to indicate the The voiceprint of the first user is similar to the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window;
    根据所述第一时间窗中所述第一声纹的验证结果,更新所述基准声纹。The reference voiceprint is updated according to the verification result of the first voiceprint in the first time window.
  2. 如权利要求1所述的方法,其特征在于,所述第一声纹的验证结果是根据所述第一声纹、基准声纹、第二时间窗的平均相似度确定的,包括:The method according to claim 1, wherein the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including:
    在所述第一声纹与所述基准声纹的相似度大于所述第二时间窗的平均相似度时,所述第一声纹的验证结果为通过验证。When the similarity between the first voiceprint and the reference voiceprint is greater than the average similarity in the second time window, the verification result of the first voiceprint is verified.
  3. 如权利要求1或2所述的方法,其特征在于,所述根据所述第一时间窗中所述第一声纹的验证结果,更新所述基准声纹,包括:The method according to claim 1 or 2, wherein updating the reference voiceprint according to the verification result of the first voiceprint in the first time window comprises:
    根据所述第一时间窗中通过验证的第一声纹的比率,更新所述基准声纹。The reference voiceprint is updated according to the ratio of the verified first voiceprint in the first time window.
  4. 如权利要求3所述的方法,其特征在于,所述根据所述第一时间窗中通过验证的第一声纹的比率,更新所述基准声纹,包括:The method according to claim 3, wherein updating the reference voiceprint according to the ratio of the verified first voiceprint in the first time window comprises:
    在所述第一时间窗中通过验证的第一声纹的比率小于或等于比率阈值时,更新所述基准声纹。When the ratio of the verified first voiceprint in the first time window is less than or equal to a ratio threshold, the reference voiceprint is updated.
  5. 如权利要求1至4任一项所述的方法,其特征在于,所述更新所述基准声纹,包括:The method according to any one of claims 1 to 4, wherein said updating said reference voiceprint comprises:
    获取所述第一用户的第二语音信号,其中,所述第二语音信号与所述基准声纹的相似度,与所述第一时间窗的平均相似度的差值小于差值阈值,所述第一时间窗的平均相似度用于指示在所述第一时间窗中获取的所述第一用户的声纹与所述基准声纹的相似情况;Acquiring the second voice signal of the first user, wherein the difference between the similarity between the second voice signal and the reference voiceprint and the average similarity in the first time window is less than a difference threshold, the The average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint;
    根据所述第二语音信号,更新所述基准声纹。The reference voiceprint is updated according to the second voice signal.
  6. 如权利要求5所述的方法,其特征在于,所述根据所述第二语音信号,更新所述基准声纹,包括:The method according to claim 5, wherein said updating said reference voiceprint according to said second voice signal comprises:
    根据所述第二语音信号和所述第二语音信号对应的文本,获得去重语音信号,所述去重语音信号对应的文本为非重复文本;Obtain a de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, and the text corresponding to the de-emphasis voice signal is non-repetitive text;
    根据所述去重语音信号,更新所述基准声纹。The reference voiceprint is updated according to the de-emphasized voice signal.
  7. 如权利要求6所述的方法,其特征在于,所述第二语音信号为一个,所述根据所述第二语音信号和所述第二语音信号对应的文本,获得去重语音信号,包括:The method according to claim 6, wherein the second speech signal is one, and obtaining the de-emphasis speech signal according to the second speech signal and the text corresponding to the second speech signal comprises:
    对所述第二语音信号对应的文本执行去重操作,得到去重文本;performing a deduplication operation on the text corresponding to the second speech signal to obtain deduplication text;
    根据所述去重文本在所述第二语音信号中对应的语音信号,获得所述去重语音信号。The de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text in the second voice signal.
  8. 如权利要求6所述的方法,其特征在于,所述第二语音信号为多个,所述根据所述第二语音信号和所述第二语音信号对应的文本,获得去重语音信号,包括:The method according to claim 6, wherein there are multiple second voice signals, and the de-emphasis voice signal is obtained according to the second voice signal and the text corresponding to the second voice signal, comprising :
    针对多个第二语音信号中的第i个第二语音信号:For the i-th second speech signal among the plurality of second speech signals:
    根据历史去重文本,对所述第i个第二语音信号对应的文本执行去重操作;将去重操作之后的、所述第i个第二语音信号对应的文本与所述历史去重文本拼接,得到所述第1个语音信号至第i个第二语音信号共同对应的去重文本;所述历史去重文本是根据所述第 1个第二语音信号至第i-1个第二语音信号分别对应的文本得到的,i大于1;According to the historical deduplication text, perform a deduplication operation on the text corresponding to the ith second voice signal; after the deduplication operation, the text corresponding to the ith second voice signal and the historical deduplication text Splicing to obtain the deduplication text corresponding to the 1st voice signal to the i-th second voice signal; the historical deduplication text is based on the 1st second voice signal to the i-1th second voice signal The text corresponding to the voice signal is obtained, i is greater than 1;
    根据所述多个第二语音共同对应的去重文本在所述多个第二语音信号中对应的语音信号,获得所述去重语音信号。The de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text in the multiple second voice signals corresponding to the multiple second voices.
  9. 如权利要求6至8任一项所述的方法,其特征在于,所述去重语音信号的时长大于时长阈值,和/或,所述去重语音信号对应的非重复文本中文字个数大于字数阈值。The method according to any one of claims 6 to 8, wherein the duration of the de-emphasized speech signal is greater than a duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasized speech signal is greater than word count threshold.
  10. 如权利要求1至9任一项所述的方法,其特征在于,还包括:The method according to any one of claims 1 to 9, further comprising:
    滑动所述第一时间窗,其中,滑动后的第一时间窗的终止点晚于滑动前的第一时间窗的终止点。sliding the first time window, wherein the end point of the first time window after sliding is later than the end point of the first time window before sliding.
  11. 如权利要求1至10任一项所述的方法,其特征在于,所述第一时间窗的长度是可变的。The method according to any one of claims 1 to 10, characterized in that the length of the first time window is variable.
  12. 如权利要求1至11任一项所述的方法,其特征在于,还包括:The method according to any one of claims 1 to 11, further comprising:
    滑动所述第二时间窗,其中,滑动后的第二时间窗的终止点晚于滑动前的第二时间窗的终止点,且所述滑动后的第二时间窗的起始点早于滑动后的第一时间窗的起始点。sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the starting point of the second time window after sliding is earlier than the end point of the second time window after sliding The starting point of the first time window of .
  13. 如权利要求1至12任一项所述的方法,其特征在于,所述获取第一时间窗中第一声纹的验证结果之前,还包括:The method according to any one of claims 1 to 12, wherein before acquiring the verification result of the first voiceprint in the first time window, further comprising:
    根据所述基准声纹和所述第二时间窗中获取的所述第一用户的第二声纹,确定所述第二时间窗中的相似度;determining the similarity in the second time window according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window;
    根据所述第二时间窗中的多个相似度,确定所述第二时间窗的平均相似度。An average similarity in the second time window is determined according to the multiple similarities in the second time window.
  14. 一种声纹管理装置,其特征在于,包括:A voiceprint management device, characterized by comprising:
    获取模块和处理模块;Acquiring modules and processing modules;
    所述获取模块用于获取第一时间窗中第一声纹的验证结果,其中,所述第一声纹的验证结果是根据所述第一声纹、基准声纹、第二时间窗的平均相似度确定的,所述第一声纹是根据在所述第一时间窗中获取的第一用户的第一语音信号确定的,所述第二时间窗的平均相似度用于指示在所述第二时间窗中获取的所述第一用户的声纹与所述基准声纹的相似情况,所述第一时间窗的起始点晚于所述第二时间窗起始点;The acquiring module is used to acquire the verification result of the first voiceprint in the first time window, wherein the verification result of the first voiceprint is based on the average of the first voiceprint, the reference voiceprint, and the second time window. The first voiceprint is determined according to the first voice signal of the first user acquired in the first time window, and the average similarity in the second time window is used to indicate the The voiceprint of the first user obtained in the second time window is similar to the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window;
    所述处理模块用于根据所述第一时间窗中所述第一声纹的验证结果,更新所述基准声纹。The processing module is configured to update the reference voiceprint according to the verification result of the first voiceprint in the first time window.
  15. 如权利要求14所述的装置,其特征在于,所述第一声纹的验证结果是根据所述第一声纹、基准声纹、第二时间窗的平均相似度确定的,包括:The device according to claim 14, wherein the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including:
    在所述第一声纹与所述基准声纹的相似度大于所述第二时间窗的平均相似度时,所述第一声纹的验证结果为通过验证。When the similarity between the first voiceprint and the reference voiceprint is greater than the average similarity in the second time window, the verification result of the first voiceprint is verified.
  16. 如权利要求14或15所述的装置,其特征在于,所述处理模块具体用于:The device according to claim 14 or 15, wherein the processing module is specifically used for:
    根据所述第一时间窗中通过验证的第一声纹的比率,更新所述基准声纹。The reference voiceprint is updated according to the ratio of the verified first voiceprint in the first time window.
  17. 如权利要求16所述的装置,其特征在于,所述处理模块具体用于:The device according to claim 16, wherein the processing module is specifically used for:
    在所述第一时间窗中通过验证的第一声纹的比率小于或等于比率阈值时,更新所述基准声纹。When the ratio of the verified first voiceprint in the first time window is less than or equal to a ratio threshold, the reference voiceprint is updated.
  18. 如权利要求14至17任一项所述的装置,其特征在于,所述处理模块具体用于:The device according to any one of claims 14 to 17, wherein the processing module is specifically used for:
    控制所述获取模块获取所述第一用户的第二语音信号,其中,所述第二语音信号与所述基准声纹的相似度,与所述第一时间窗的平均相似度的差值小于差值阈值,所述第一时 间窗的平均相似度用于指示在所述第一时间窗中获取的所述第一用户的声纹与所述基准声纹的相似情况;controlling the acquiring module to acquire the second voice signal of the first user, wherein the difference between the similarity between the second voice signal and the reference voiceprint and the average similarity of the first time window is less than difference threshold, the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint;
    根据所述第二语音信号,更新所述基准声纹。The reference voiceprint is updated according to the second voice signal.
  19. 如权利要求18所述的装置,其特征在于,所述处理模块具体用于:The device according to claim 18, wherein the processing module is specifically used for:
    根据所述第二语音信号和所述第二语音信号对应的文本,获得去重语音信号,所述去重语音信号对应的文本为非重复文本;Obtain a de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, and the text corresponding to the de-emphasis voice signal is non-repetitive text;
    根据所述去重语音信号,更新所述基准声纹。The reference voiceprint is updated according to the de-emphasized voice signal.
  20. 如权利要求18所述的装置,其特征在于,所述第二语音信号为一个,所述处理模块具体用于:The device according to claim 18, wherein there is one second voice signal, and the processing module is specifically used for:
    对所述第二语音信号对应的文本执行去重操作,得到去重文本;performing a deduplication operation on the text corresponding to the second speech signal to obtain deduplication text;
    根据所述去重文本在所述第二语音信号中对应的语音信号,获得所述去重语音信号。The de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text in the second voice signal.
  21. 如权利要求18所述的装置,其特征在于,所述第二语音信号为多个,所述处理模块具体用于:The device according to claim 18, wherein there are multiple second voice signals, and the processing module is specifically used for:
    针对多个第二语音信号中的第i个第二语音信号:For the i-th second speech signal among the plurality of second speech signals:
    根据历史去重文本,对所述第i个第二语音信号对应的文本执行去重操作;将去重操作之后的、所述第i个第二语音信号对应的文本与所述历史去重文本拼接,得到所述第1个语音信号至第i个第二语音信号共同对应的去重文本;所述历史去重文本是根据所述第1个第二语音信号至第i-1个第二语音信号分别对应的文本得到的,i大于1;According to the historical deduplication text, perform a deduplication operation on the text corresponding to the ith second voice signal; after the deduplication operation, the text corresponding to the ith second voice signal and the historical deduplication text Splicing to obtain the deduplication text corresponding to the 1st voice signal to the i-th second voice signal; the historical deduplication text is based on the 1st second voice signal to the i-1th second voice signal The text corresponding to the voice signal is obtained, i is greater than 1;
    根据所述多个第二语音共同对应的去重文本在所述多个第二语音信号中对应的语音信号,获得所述去重语音信号。The de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text corresponding to the multiple second voices in the multiple second voice signals.
  22. 如权利要求18至21任一项所述的装置,其特征在于,所述去重语音信号的时长大于时长阈值,和/或,所述去重语音信号对应的非重复文本中文字个数大于字数阈值。The device according to any one of claims 18 to 21, wherein the duration of the de-emphasis speech signal is greater than a duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than word count threshold.
  23. 如权利要求14至22任一项所述的装置,其特征在于,所述处理模块还用于:The device according to any one of claims 14 to 22, wherein the processing module is further used for:
    滑动所述第一时间窗,其中,滑动后的第一时间窗的终止点晚于滑动前的第一时间窗的终止点。sliding the first time window, wherein the end point of the first time window after sliding is later than the end point of the first time window before sliding.
  24. 如权利要求14至23任一项所述的装置,其特征在于,所述第一时间窗的长度是可变的。The device according to any one of claims 14 to 23, wherein the length of the first time window is variable.
  25. 如权利要求14至24任一项所述的装置,其特征在于,所述处理模块还用于:The device according to any one of claims 14 to 24, wherein the processing module is further used for:
    滑动所述第二时间窗,其中,滑动后的第二时间窗的终止点晚于滑动前的第二时间窗的终止点,且所述滑动后的第二时间窗的起始点早于滑动后的第一时间窗的起始点。sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the starting point of the second time window after sliding is earlier than the end point of the second time window after sliding The starting point of the first time window of .
  26. 如权利要求14至25任一项所述的装置,其特征在于,在所述获取模块获取第一时间窗中第一声纹的验证结果之前,所述处理模块还用于:The device according to any one of claims 14 to 25, wherein before the acquiring module acquires the verification result of the first voiceprint in the first time window, the processing module is further configured to:
    根据所述基准声纹和所述第二时间窗中获取的所述第一用户的第二声纹,确定所述第二时间窗中的相似度;determining the similarity in the second time window according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window;
    根据所述第二时间窗中的多个相似度,确定所述第二时间窗的平均相似度。An average similarity in the second time window is determined according to the multiple similarities in the second time window.
  27. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序或指令,当所述计算机程序或指令被计算设备执行时,实现如权利要求1至13中任一项所述的方法。A computer program product, characterized in that the computer program product includes a computer program or instruction, and when the computer program or instruction is executed by a computing device, the method according to any one of claims 1 to 13 is implemented.
  28. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机 程序或指令,当所述计算机程序或指令被计算设备执行时,实现如权利要求1至13中任一项所述的方法。A computer-readable storage medium, characterized in that computer programs or instructions are stored in the computer-readable storage medium, and when the computer programs or instructions are executed by a computing device, any one of claims 1 to 13 can be realized. method described in the item.
  29. 一种计算设备,其特征在于,包括处理器,所述处理器与存储器相连,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以使得所述计算设备执行如权利要求1至13中任一项所述的方法。A computing device, characterized in that it includes a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the calculation The device performs the method as claimed in any one of claims 1 to 13.
PCT/CN2021/093917 2021-05-14 2021-05-14 Voiceprint management method and apparatus WO2022236827A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180041437.8A CN115699168A (en) 2021-05-14 2021-05-14 Voiceprint management method and device
PCT/CN2021/093917 WO2022236827A1 (en) 2021-05-14 2021-05-14 Voiceprint management method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/093917 WO2022236827A1 (en) 2021-05-14 2021-05-14 Voiceprint management method and apparatus

Publications (1)

Publication Number Publication Date
WO2022236827A1 true WO2022236827A1 (en) 2022-11-17

Family

ID=84027940

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/093917 WO2022236827A1 (en) 2021-05-14 2021-05-14 Voiceprint management method and apparatus

Country Status (2)

Country Link
CN (1) CN115699168A (en)
WO (1) WO2022236827A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116597839A (en) * 2023-07-17 2023-08-15 山东唐和智能科技有限公司 Intelligent voice interaction system and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070219801A1 (en) * 2006-03-14 2007-09-20 Prabha Sundaram System, method and computer program product for updating a biometric model based on changes in a biometric feature of a user
CN108231082A (en) * 2017-12-29 2018-06-29 广州势必可赢网络科技有限公司 Updating method and device for self-learning voiceprint recognition
CN110400567A (en) * 2019-07-30 2019-11-01 深圳秋田微电子股份有限公司 Register vocal print dynamic updating method and computer storage medium
CN110896352A (en) * 2018-09-12 2020-03-20 阿里巴巴集团控股有限公司 Identity recognition method, device and system
CN111063360A (en) * 2020-01-21 2020-04-24 北京爱数智慧科技有限公司 Voiceprint library generation method and device
CN112435673A (en) * 2020-12-15 2021-03-02 北京声智科技有限公司 Model training method and electronic terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070219801A1 (en) * 2006-03-14 2007-09-20 Prabha Sundaram System, method and computer program product for updating a biometric model based on changes in a biometric feature of a user
CN108231082A (en) * 2017-12-29 2018-06-29 广州势必可赢网络科技有限公司 Updating method and device for self-learning voiceprint recognition
CN110896352A (en) * 2018-09-12 2020-03-20 阿里巴巴集团控股有限公司 Identity recognition method, device and system
CN110400567A (en) * 2019-07-30 2019-11-01 深圳秋田微电子股份有限公司 Register vocal print dynamic updating method and computer storage medium
CN111063360A (en) * 2020-01-21 2020-04-24 北京爱数智慧科技有限公司 Voiceprint library generation method and device
CN112435673A (en) * 2020-12-15 2021-03-02 北京声智科技有限公司 Model training method and electronic terminal

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116597839A (en) * 2023-07-17 2023-08-15 山东唐和智能科技有限公司 Intelligent voice interaction system and method
CN116597839B (en) * 2023-07-17 2023-09-19 山东唐和智能科技有限公司 Intelligent voice interaction system and method

Also Published As

Publication number Publication date
CN115699168A (en) 2023-02-03

Similar Documents

Publication Publication Date Title
WO2017197953A1 (en) Voiceprint-based identity recognition method and device
WO2018149209A1 (en) Voice recognition method, electronic device, and computer storage medium
US10204619B2 (en) Speech recognition using associative mapping
CN110265037B (en) Identity verification method and device, electronic equipment and computer readable storage medium
CN117577099A (en) Method, system and medium for multi-user authentication on a device
JP6469252B2 (en) Account addition method, terminal, server, and computer storage medium
WO2020155584A1 (en) Method and device for fusing voiceprint features, voice recognition method and system, and storage medium
WO2017162053A1 (en) Identity authentication method and device
WO2016014321A1 (en) Real-time emotion recognition from audio signals
WO2021051572A1 (en) Voice recognition method and apparatus, and computer device
CN111243603B (en) Voiceprint recognition method, system, mobile terminal and storage medium
US10923101B2 (en) Pausing synthesized speech output from a voice-controlled device
US11443730B2 (en) Initiating synthesized speech output from a voice-controlled device
KR101888058B1 (en) The method and apparatus for identifying speaker based on spoken word
WO2021213490A1 (en) Identity verification method and apparatus and electronic device
TW202018696A (en) Voice recognition method and device and computing device
CN116508097A (en) Speaker recognition accuracy
CN110634492A (en) Login verification method and device, electronic equipment and computer readable storage medium
WO2022236827A1 (en) Voiceprint management method and apparatus
CN109065026B (en) Recording control method and device
US10657951B2 (en) Controlling synthesized speech output from a voice-controlled device
US11763806B1 (en) Speaker recognition adaptation
CN110970027B (en) Voice recognition method, device, computer storage medium and system
CN108564374A (en) Payment authentication method, device, equipment and storage medium
US20230113883A1 (en) Digital Signal Processor-Based Continued Conversation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21941396

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21941396

Country of ref document: EP

Kind code of ref document: A1