WO2022236827A1

WO2022236827A1 - Voiceprint management method and apparatus

Info

Publication number: WO2022236827A1
Application number: PCT/CN2021/093917
Authority: WO
Inventors: 张嘉祺
Original assignee: 华为技术有限公司
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2022-11-17
Also published as: CN115699168A

Abstract

A voiceprint management method and apparatus, which are used to timely and accurately update a reference voiceprint in a voiceprint management apparatus, thereby improving recognition accuracy in voiceprint recognition. The method comprises: acquiring a verification result of a first voiceprint in a first time window, wherein the verification result of the first voiceprint is determined according to the first voiceprint, a reference voiceprint and the average similarity of a second time window, the first voiceprint is determined according to a first voice signal of a first user acquired in the first time window, the average similarity of the second time window is used to indicate the similarity between a voiceprint of the first user acquired in the second time window and the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window; and updating the reference voiceprint according to the verification result of the first voiceprint in the first time window.

Description

A voiceprint management method and device

technical field

The present application relates to the technical field of speech recognition, in particular to a voiceprint management method and device.

Background technique

With the iteration of technology and the improvement of legal regulations, voiceprint recognition technology has been widely used in many scenarios, such as vehicle network scenarios, smart home scenarios, and business processing scenarios. The voiceprint recognition technology is to compare the voiceprint stored in the voiceprint management device with the collected voiceprint to judge the identity information of the user.

The user's reference voiceprint (that is, the voiceprint registered by the user in the voiceprint management device) is pre-stored in the voiceprint management device. In voiceprint recognition, when the voiceprint management device acquires the collected voiceprint of the user, it can compare the collected voiceprint with the reference voiceprint, and then determine that the collected voiceprint and the reference voiceprint correspond to the same user.

However, the user's voiceprint may change with time. In order to ensure the accuracy of voiceprint recognition, it is necessary to update the reference voiceprint in the voiceprint management device in time.

Contents of the invention

The present application provides a voiceprint management method and device, which are used to timely and accurately update a reference voiceprint in a voiceprint management device, thereby improving recognition accuracy in voiceprint recognition.

In the first aspect, the present application provides a voiceprint management method, which can be executed by a terminal device, such as a vehicle-mounted device (such as a car machine, a vehicle-mounted computer, etc.), and a user terminal (such as a mobile phone, a computer, etc.). The voiceprint management method can also be implemented by components of the terminal equipment, such as processing devices, circuits, chips and other components in the terminal equipment, such as system chips or processing chips in the terminal equipment. Wherein, the system chip is also called a system on a chip, or a system on chip (system on chip, SOC) chip.

The method may also be executed by a server, and the server may include physical devices such as hosts or processors, or virtual devices such as virtual machines or containers, and may also include chips or integrated circuits. The server is, for example, an IoV server, also known as a cloud server, cloud, cloud, cloud server, or cloud controller. The IOV server may be a single server or a server cluster composed of multiple servers, which is not specifically limited. In addition, the method may also be executed by components of the server, for example, implemented by components such as processing devices, circuits, and chips in the server.

In a possible implementation manner, the voiceprint management method provided by the present application includes: acquiring the verification result of the first voiceprint in the first time window, wherein the verification result of the first voiceprint is based on the first voiceprint, benchmark The average similarity of the voiceprint and the second time window is determined. The first voiceprint is determined according to the first voice signal of the first user acquired in the first time window. The average similarity of the second time window is used to indicate In the similarity between the voiceprint of the first user obtained in the second time window and the reference voiceprint, the starting point of the first time window is later than the starting point of the second time window; according to the verification of the first voiceprint in the first time window As a result, the reference voiceprint is updated.

It should be understood that, in the above technical solution, the average similarity of the second time window is predetermined, and the average similarity of the second time window and the reference voiceprint are used together as a reference parameter to determine whether to update the reference voiceprint, It can avoid the problem of inaccurate judgment caused by inaccurate single registration voiceprint when the first user registers. Further, determine the similarities between the multiple first voiceprints in the first time window and the reference voiceprint respectively, and combine the average similarity in the second time window to determine whether the voiceprint of the first user has undergone permanent changes. When the voiceprint of the first user changes permanently, the reference voiceprint of the first user is updated; and when the voiceprint of the first user does not change permanently (such as a short-term change), the first user’s reference voiceprint is not updated. The user's baseline voiceprint. It helps to improve the accuracy of the reference voiceprint of the first user, thereby improving the recognition accuracy and robustness in voiceprint recognition.

In a possible implementation, the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including: When the similarity is greater than the average similarity of the second time window, the verification result of the first voiceprint is verified, and/or, when the similarity between the first voiceprint and the reference voiceprint is less than or equal to the average similarity of the second time window When the degree is higher, the verification result of the first voiceprint is failed.

It should be understood that, in the above technical solution, the average similarity between the reference voiceprint and the second time window is used as the reference parameter for judging whether the first voiceprint passes the verification, so as to avoid accidental reasons when the first user registers. The problem of inaccurate registered voiceprints caused by this will help to improve the accuracy of the verification results of the first voiceprint. Further, it is only necessary to store the verification result of the first voiceprint in the first time window (passed or not verified), and there is no need to store multiple first voiceprints in the first time window, which helps to reduce the amount of storage .

In a possible implementation manner, updating the reference voiceprint according to the verification result of the first voiceprint in the first time window includes: according to the ratio of the first voiceprint that passes verification in the first time window (that is, the first time window), update the reference voiceprint, wherein the ratio of the first voiceprint that has passed verification is the ratio of the number of first voiceprints that have passed verification in the first time window to the number of first voiceprints in the first time window ratio. In a possible implementation manner, updating the reference voiceprint according to the ratio of the first voiceprint that passes verification in the first time window includes: the ratio of the first voiceprint that passes verification in the first time window is less than or equal to When the ratio threshold is reached, update the reference voiceprint.

It should be understood that, in the above technical solution, the verification results of multiple first voiceprints can be obtained in the first time window, and then the verified ones can be determined according to the verification results of the multiple first voiceprints in the first time window. The ratio of the first voiceprint, which can be used to indicate whether the voiceprint of the first user has changed permanently, and then update or not update the reference voiceprint of the first user according to the ratio of the first voiceprint. Avoid updating the reference voiceprint of the first user by mistake due to the contingency of the verification result of a first voiceprint. It is helpful to more accurately identify whether the voiceprint of the first user has changed permanently, so as to update the reference voiceprint of the first user when the voiceprint of the first user has changed permanently; In the case that the voiceprint has not changed permanently, the original reference voiceprint of the first user is maintained.

In a possible implementation manner, updating the reference voiceprint includes: acquiring the second voice signal of the first user, wherein the similarity between the second voice signal and the reference voiceprint is equal to the average similarity in the first time window The difference is less than the difference threshold, and the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; according to the second voice signal, update the reference voiceprint .

It should be understood that, in the above technical solution, the similarity between the acquired second voice signal and the reference voiceprint meets the preset condition, which can be understood as, from the starting point of the first time window to the time when the second voice signal is acquired During the time period, if the change of the voiceprint of the first user is less than the change threshold (or it is understood that the voiceprint of the first user is in a stable state or has not changed for a long time), the reference voiceprint can be updated according to the second voice signal.

In a possible implementation manner, updating the reference voiceprint according to the second voice signal includes: obtaining the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, wherein the text corresponding to the de-emphasis voice signal It is non-repetitive text; update the reference voiceprint according to the de-emphasis voice signal.

In a possible implementation manner, there is one second voice signal, and obtaining the deduplication voice signal according to the second voice signal and the text corresponding to the second voice signal includes: performing a deduplication operation on the text corresponding to the second voice signal , to obtain the de-emphasis text; according to the voice signal corresponding to the de-emphasis text in the second voice signal, the de-emphasis voice signal is obtained.

In a possible implementation manner, there are multiple second voice signals, and obtaining the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal includes: for the i-th voice signal among the multiple second voice signals A second voice signal: according to the historical deduplication text, perform a deduplication operation on the text corresponding to the ith second voice signal; after the deduplication operation, the text corresponding to the i second voice signal and the historical deduplication text Splicing to obtain the deduplication text corresponding to the 1st voice signal to the i-th second voice signal; wherein the historical deduplication text is corresponding to the 1st second voice signal to the i-1th second voice signal respectively The text is obtained, i is greater than 1; the de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text in the multiple second voice signals corresponding to the multiple second voice signals.

It should be understood that, in the above technical solution, according to the second voice signal and the text corresponding to the second voice signal, the second voice signal is deduplicated to obtain the deemphasized voice signal, thereby avoiding the There are high-frequency characters or high-frequency words that appear many times in the file, which affects the accuracy of the extracted reference voiceprint.

In a possible implementation manner, the duration of the de-emphasis speech signal is greater than the duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than the word-number threshold.

It should be understood that, in the above technical solution, when the duration of the de-emphasis speech signal is greater than the duration threshold, and/or, when the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than the number of words threshold, update the text according to the de-emphasis speech signal The reference voiceprint, so that a deduplicated speech signal of sufficient duration and/or corresponding to a sufficient number of words can be obtained, and the accuracy of the extracted reference voiceprint can be further improved.

In a possible implementation manner, the method further includes: sliding the first time window, wherein an end point of the first time window after sliding is later than an end point of the first time window before sliding.

It should be understood that, in the above technical solution, the first time window can be slid, and then the verification result of the first voiceprint in the slid first time window can be obtained, and according to the verification result of the first voiceprint in the slid first time window, For the verification result, determine the ratio of the first voiceprint that passes the verification in the first time window after sliding (that is, the compliance rate in the first time window after sliding). Wherein, the end point of the first time window after sliding is later than the end point of the first time window before sliding, and the interval between the two is shorter than the time length of the first time window. It is equivalent to that the calculation of the compliance rate in the first time window is dynamic, and the compliance rate of the time span (that is, the first time window) will be determined every certain time interval, so that the long-term occurrence of the voiceprint of the first user can be found in time. Change, and update the registered voiceprint of the first user.

In a possible implementation manner, the length of the first time window is variable.

It should be understood that, in the above technical solution, the first number threshold can be set, and when it is determined that the number of verification results of the first voiceprint in the first time window is less than the first number threshold, the first number threshold can be automatically extended. The duration of a time window is until the number of determined verification results of the first voiceprint reaches the first number threshold. According to the verification result of the first voiceprint reaching the first number threshold, it is helpful to improve the accuracy of identifying the permanent change of the voiceprint of the first user.

In a possible implementation manner, it also includes: sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the second time window after sliding The starting point of the window is earlier than the starting point of the first time window after sliding.

In a possible implementation manner, the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding, including: when updating the reference voiceprint, sliding the second time window, wherein sliding The starting point of the subsequent second time window is no earlier than the time point of updating the reference voiceprint.

In a possible implementation manner, the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding, including: when updating the reference voiceprint, sliding the second time window, wherein sliding The starting point of the subsequent second time window is no earlier than the time point at which the second speech signal is acquired.

It should be understood that, in the above technical solution, when updating the reference voiceprint, the second time window can be slid, and then the second voiceprint in the second time window after sliding can be acquired, and the second voiceprint in the second time window after sliding can be determined. The similarity between the second voiceprint and the reference voiceprint is used to obtain the average similarity of the second time window after sliding. After updating the reference voiceprint, updating the average similarity in the second time window in time is helpful for subsequently determining whether the voiceprint of the first user has undergone permanent changes again.

In a possible implementation manner, the length of the second time window is variable.

It should be understood that, in the above technical solution, a second times threshold may be set, and when it is determined that the number of similarities between the second voiceprint and the reference voiceprint in the second time window is less than the second times threshold, it may be The duration of the second time window is automatically extended until the number of determined similarities between the second voiceprint and the reference voiceprint reaches the second times threshold. According to the similarity corresponding to the second voiceprint reaching the threshold of the second number of times, it is helpful to improve the accuracy of the average similarity in the second time window.

In a possible implementation manner, before obtaining the verification result of the first voiceprint in the first time window, it further includes: determining the second voiceprint according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window. The similarity in the second time window; according to the multiple similarities in the second time window, determine the average similarity in the second time window.

It should be understood that, in the above technical solution, it is only necessary to store the average similarity of the second time window, instead of storing multiple second voiceprints in the second time window, which helps to reduce the amount of storage. The average similarity in the second time window is used to determine the verification result of the first voiceprint in the first time window, which is easy to implement.

In a second aspect, the present application provides a voiceprint management device, which may be a terminal device, or a component (such as a processing device, circuit, chip, etc.) in the terminal device. Wherein, the terminal equipment includes vehicle-mounted equipment (such as a car machine, a vehicle-mounted computer, etc.), and a user terminal (such as a mobile phone, a computer, etc.). The device may also be a server, or a component (such as a processing device, a circuit, a chip, etc.) in a server. Wherein, the server may include physical devices such as hosts or processors, virtual devices such as virtual machines or containers, and may also include chips or integrated circuits. The server is, for example, an IoV server, also known as a cloud server, cloud, cloud, cloud server, or cloud controller. The IOV server may be a single server or a server cluster composed of multiple servers, which is not specifically limited.

In a possible implementation, the voiceprint management device provided in this application includes: an acquisition module and a processing module; the acquisition module is used to acquire the verification result of the first voiceprint in the first time window, wherein the first voiceprint The verification result of is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, the first voiceprint is determined according to the first voice signal of the first user acquired in the first time window, The average similarity of the second time window is used to indicate the similarity between the voiceprint of the first user acquired in the second time window and the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window; processing The module is used to update the reference voiceprint according to the verification result of the first voiceprint in the first time window.

In a possible implementation, the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including: When the similarity is greater than the average similarity of the second time window, the verification result of the first voiceprint is verified.

In a possible implementation manner, the processing module is specifically configured to: update the reference voiceprint according to the ratio of the verified first voiceprint in the first time window.

In a possible implementation manner, the processing module is specifically configured to: update the reference voiceprint when the ratio of the verified first voiceprint in the first time window is less than or equal to a ratio threshold.

In a possible implementation, the processing module is specifically configured to: control the acquisition module to acquire the second voice signal of the first user, wherein the similarity between the second voice signal and the reference voiceprint is similar to the average of the first time window degree difference is less than the difference threshold, the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; according to the second voice signal, update the reference Voiceprint.

In a possible implementation manner, the processing module is specifically configured to: obtain the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, wherein the text corresponding to the de-emphasis voice signal is non-repetitive text; Deduplicate the voice signal and update the reference voiceprint.

In a possible implementation manner, there is one second voice signal, and the processing module is specifically configured to: perform a deduplication operation on the text corresponding to the second voice signal to obtain the deduplication text; The corresponding voice signal is obtained to obtain a de-emphasis voice signal.

In a possible implementation manner, there are multiple second voice signals, and the processing module is specifically configured to: for the ith second voice signal among the multiple second voice signals: The text corresponding to the second voice signal performs a deduplication operation; after the deduplication operation, the text corresponding to the ith second voice signal is spliced with the historical deduplication text to obtain the 1st voice signal to the i th second voice signal Commonly corresponding deduplication text; historical deduplication text is obtained according to the texts corresponding to the 1st second voice signal to the i-1th second voice signal, i is greater than 1; according to the common correspondence of multiple second voices The voice signal corresponding to the de-emphasis text among the plurality of second voice signals is obtained to obtain the de-emphasis voice signal.

In a possible implementation manner, the processing module is further configured to: slide the first time window, wherein an end point of the first time window after sliding is later than an end point of the first time window before sliding.

In a possible implementation manner, the processing module is further configured to: slide the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the end point of the second time window after sliding is The starting point of the second time window is earlier than the starting point of the first time window after sliding.

In a possible implementation manner, before the obtaining module obtains the verification result of the first voiceprint in the first time window, the processing module is further configured to: Two voiceprints, determining the similarity in the second time window; determining the average similarity in the second time window according to the multiple similarities in the second time window.

In a third aspect, the present application provides a computer program product, the computer program product includes computer programs or instructions, and when the computer program or instructions are executed by a computing device, the above first aspect or any possible implementation of the first aspect can be realized method in .

In a fourth aspect, the present application provides a computer-readable storage medium, in which computer programs or instructions are stored, and when the computer programs or instructions are executed by a computing device, the above-mentioned first aspect or any of the first aspects can be realized. A method in one possible implementation.

In a fifth aspect, the present application provides a computing device, including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the computing device performs the above-mentioned first aspect or A method in any possible implementation manner of the first aspect.

In the sixth aspect, the embodiment of the present application provides a chip system, including: a processor and a memory, the processor and the memory are coupled, the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the chip system realizes The above first aspect or the method in any possible implementation manner of the first aspect.

Optionally, the chip system further includes an interface circuit for exchanging code instructions to the processor.

Optionally, there may be one or more processors in the chip system, and the processors may be implemented by hardware or by software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented in software, the processor may be a general-purpose processor, implemented by reading software codes stored in memory.

Optionally, there may be one or more memories in the chip system. The memory can be integrated with the processor, or can be set separately from the processor. Exemplarily, the memory may be a non-transitory processor, such as a read-only memory ROM, which may be integrated with the processor on the same chip, or may be respectively provided on different chips.

It should be understood that, in the above-mentioned technical solutions from the first aspect to the sixth aspect, the reference voiceprint of the first user is determined in advance, and then the second voiceprint of the first user is acquired in the second time window to determine the second voiceprint Based on the similarity with the reference voiceprint, an average similarity in the second time window is determined according to multiple similarities in the second time window. Since the second time window is after the time point when the reference voiceprint is stored, that is, during the time period from the time point when the user registers the reference voiceprint to the end point of the second time window, the user's voiceprint is in a steady change, which can be based on The voiceprint of the first user that is changing steadily is used to obtain a reference parameter (that is, the average similarity between the reference voiceprint and the second time window). In the subsequent process of determining whether the user's voiceprint has undergone permanent changes, the similarity between the first voiceprint and the reference voiceprint can be determined based on the user's first voiceprint in the first time window, and the similarity can be compared with the second Compare the average similarity of the time window to determine the verification result of the first voiceprint, that is, determine whether the first voiceprint in the first time window is similar to the second voiceprint in the second time window, and then determine the start of the second time window During the time period from the start point to the end point of the first time window, whether the user's voiceprint has undergone a long-term change.

Moreover, in this application, the verification results of multiple first voiceprints in the first time window are determined, and the ratio of the first voiceprints that pass verification in the first time window (that is, the compliance rate in the first time window) is determined. According to the first The compliance rate in the time window determines whether the user's voiceprint has changed for a long time, avoiding misjudgment caused by the user's accidental reasons in the first time window, and helping to improve the accuracy of the judgment.

Further, when the reference voiceprint is not updated, the first time window can be slid to determine the ratio of the verified first voiceprint in the first time window after sliding (that is, the compliance rate in the first time window after sliding) , according to the compliance rate in the first time window after sliding, it is determined whether the voiceprint of the user has undergone a permanent change, which is helpful to timely discover the permanent change of the voiceprint of the first user and update the registered voiceprint of the first user.

Further, when updating the reference voiceprint, the second voice signal is acquired, and the voiceprint of the first user is determined according to the acquired second voice signal, without interacting with the user, so that the user has no perception, which helps to improve user experience. Moreover, there is no need to pre-store multiple first voiceprints in the first time window, which helps to reduce the amount of storage. In this application, there is no need to store multiple second voiceprints in the second time window, and only the average similarity of the second time window needs to be saved, further reducing the storage capacity.

Description of drawings

Fig. 1a is a schematic structural diagram of a speech signal processing system provided by the present application;

Fig. 1b is a schematic flow diagram of a semantic understanding process exemplarily provided by the present application;

Fig. 1c is a schematic flow chart of a voiceprint extraction process exemplarily provided by the present application;

FIG. 2 is a schematic diagram of a vehicle scene provided by the present application;

FIG. 3 is a schematic diagram of another vehicle-mounted scene provided by the present application;

Fig. 4 is a schematic display diagram of a mobile phone interface exemplarily provided by the present application;

Fig. 5 is a schematic diagram of the corresponding relationship between a voiceprint management process and time provided by the present application;

Fig. 6 is a schematic flow diagram of a voiceprint verification provided by the present application;

FIG. 7 is a schematic diagram of another voiceprint management process and the corresponding relationship between time provided by the present application;

Fig. 8 is a schematic flow chart of updating a reference voiceprint provided by the present application;

FIG. 9 is a schematic diagram of yet another vehicle-mounted scene exemplarily provided by the present application;

Fig. 10 is a schematic flowchart of another voiceprint management method exemplarily provided by the present application;

Fig. 11 is a schematic structural diagram of a voiceprint management device exemplarily provided by the present application;

Fig. 12 is a schematic structural diagram of another voiceprint management device provided by the present application as an example.

Detailed ways

Embodiments of the present application will be described in detail below in conjunction with the accompanying drawings.

Fig. 1a is a schematic structural diagram of a voice signal processing system exemplarily provided in the present application, the voice signal processing system may at least include a voice collection device, a voiceprint management device and a semantic understanding device.

Exemplarily, the voice signal processing system can collect the user's voice signal through the voice collection device, and input the voice signal into the semantic understanding device and the voiceprint management device respectively, wherein the semantic understanding device can be used to perform a semantic extraction process on the voice signal, and obtain The machine-recognizable instruction corresponding to the user's voice signal; the voiceprint management device can determine the user's voiceprint feature vector according to the voice signal, and perform a corresponding recognition process based on the voiceprint feature vector.

The voice collection device may be a microphone array (microphone array), and the microphone array may be composed of a certain number of acoustic sensors (usually microphones). The microphone array can have one or more of the following functions: speech enhancement, the process of extracting pure speech from noisy speech signals; sound source localization, using the microphone array to calculate the angle and distance of the target speaker, so as to achieve target Speaker tracking and subsequent directional voice pickup; de-reverberation to reduce the impact of some reflected sound; sound source signal extraction/separation to extract all mixed sounds. Microphone arrays can be applied to complex environments with multiple noises, noises, and echoes in vehicles, outdoors, and supermarkets.

Referring to a schematic flowchart of a semantic understanding process exemplarily shown in FIG. 1b, the semantic understanding device may sequentially perform the following processing on the speech signal:

(1) Automatic speech recognition (auto speech recognition, ASR): convert the speech signal input by the user into natural language text. In a possible manner, the voice signal may be processed as a sound wave. Specifically, the voice signal is processed in frames to obtain a small segment of waveform corresponding to each frame. For a small segment of waveform corresponding to each frame, the small segment of waveform is transformed into multi-dimensional vector information according to human ear characteristics, wherein the duration of each frame may be about 20ms to 30ms. According to the multi-dimensional vector information, multiple phonemes (phones) corresponding to the multi-dimensional vector information are decoded, and the multiple phonemes are formed into words and concatenated into sentences (ie, natural language text) for output.

(2) Natural language processing (NLP): Convert meaningful parts of natural language text into structured information that machines can understand.

(3) Semantic slot filling: Fill the structured information obtained from natural language processing into the corresponding slots, so that user intentions can be converted into machine-recognizable user instructions.

Referring to a schematic flow chart of a voiceprint extraction process shown in Figure 1c, the voiceprint management device may include a voiceprint extraction module, and the voiceprint extraction module may be used to perform voiceprint extraction on voice signals to obtain corresponding voiceprint features vector. Exemplarily, the voiceprint extraction module may perform pre-processing (also called pre-processing), voiceprint feature extraction and post-processing on the voice signal in sequence.

(1) Preprocessing: extract the audio feature information in the speech signal. Exemplarily, in the process of extracting audio feature information, at least one or more of the following operations may be performed on the speech signal: denoising, voice activity detection (voice activity detection, VAD), perceptual linear predictive (perceptual linear predictive) , PLP), Mel-frequency cepstral coefficients (MFCC) calculation.

(2) Voiceprint feature extraction: the audio feature information is transmitted to the voiceprint feature extraction model, and correspondingly, the voiceprint feature extraction model outputs a voiceprint feature vector. Voiceprint feature extraction models include but are not limited to: Gaussian mixed model (gaussian mixed model, GMM), joint factor analysis (joint factor analysis, JFA) model, i-vector (i-vector) model, d-vector (d-vector ) model, one or more items in the x-vector (x-vector) model. In this application, the voiceprint feature vector may be referred to as voiceprint for short.

(3) Post-processing: Perform post-processing on the voiceprint output by the voiceprint feature extraction model to obtain the final voiceprint. Exemplary, the post-processing may include one or more of the following: linear discriminant analysis (linear discriminant analysis, LDA), probabilistic linear discriminant analysis (probabilistic linear discriminant analysis, PLDA), disturbance attribute projection (nuisance attribute projection, NAP ).

As above, through the voiceprint extraction module, the corresponding voiceprint can be extracted from the voice signal. The voiceprint is the same as the face, fingerprint, iris, etc., and belongs to a kind of biometric information. According to the voiceprint, it can be determined to initiate the Identity information of the user of the voice signal. According to the voiceprint recognition of the user's identity information, compared with face recognition, it is not restricted by the user's facial occlusion, and compared with fingerprint recognition, it is not restricted by physical contact, and it is more implementable.

The voiceprint management device may also include a voiceprint recognition module, which can be used to identify the user's identity information according to the voiceprint. In a possible implementation, the voiceprint recognition module pre-stores the user's reference voiceprint, and the voiceprint recognition module can compare the voiceprint (which can be called the collected voiceprint) from the voiceprint extraction module with the reference voiceprint. Compare to determine whether the collected voiceprint and the reference voiceprint correspond to the same user.

In this application, the voiceprint recognition module may determine whether the collected voiceprint and the reference voiceprint correspond to the same user according to the similarity threshold. In this application, the voiceprint recognition module may take the similarity between registered voiceprints and collected voiceprints of N users as samples, and determine the similarity threshold based on the samples, where N is a positive integer. In an optional manner, the voiceprint recognition module can obtain N similarities between the registered voiceprint and the collected voiceprint of each user among the N users. Then a similarity threshold is determined according to the N similarities. For example, the voiceprint recognition module may use the average value or median value of N similarities as the similarity threshold. For example, the similarity threshold may be 0.75.

In the first example, if the voiceprint recognition module determines that the similarity between the collected voiceprint and the reference voiceprint is greater than the similarity threshold, it can determine that the collected voiceprint and the reference voiceprint correspond to the same user. If the voiceprint recognition module determines that the similarity between the collected voiceprint and the reference voiceprint is less than or equal to the similarity threshold, it can determine that the collected voiceprint and the reference voiceprint correspond to different users. In this application, the similarity between the collected voiceprint and the reference voiceprint can be greater than the similarity threshold, which means that the collected voiceprint and the reference voiceprint match; If the similarity is less than or equal to the similarity threshold, it is understood that the collected voiceprint and the reference voiceprint do not match.

In the second example, the voiceprint recognition module may also determine that the collected voiceprint and the reference voiceprint correspond to the same user. The voiceprint recognition module determines that the collected voiceprint and the reference voiceprint correspond to different users when the similarity between the collected voiceprint and the reference voiceprint is determined to be less than the similarity threshold. Correspondingly, in this application, the similarity between the collected voiceprint and the reference voiceprint can be greater than or equal to the similarity threshold, which means that the collected voiceprint and the reference voiceprint match; If the similarity between the fingerprints is less than the similarity threshold, it is understood that the collected voiceprint and the reference voiceprint do not match.

For the convenience of description, the first example is taken as an example.

What needs to be added is that when the similarities between the collected voiceprint and multiple pre-registered reference voiceprints are all greater than the similarity threshold, the reference with the greatest similarity to the collected voiceprint among the multiple reference voiceprints can be The voiceprint is used as the matching voiceprint of the collected voiceprint, that is, it is determined that among the multiple reference voiceprints, the reference voiceprint with the greatest similarity to the collected voiceprint matches the collected voiceprint, while other reference voiceprints match the collected voiceprint. The collected voiceprints do not match.

Voiceprint recognition can be used in the identification process. Exemplarily, one or more reference voiceprints of users are stored in the voiceprint recognition module, and the voiceprint recognition module can combine the collected voiceprint with the stored one or more reference voiceprints. The user's identity information is determined according to the determined reference voiceprint, and the reference voiceprint matching the currently collected voiceprint is determined from one or more reference voiceprints. Further, different permissions corresponding to different users may be stored in the voiceprint recognition module. After the voiceprint recognition module determines the user's identity information according to the collected voiceprint, it can further determine the user's corresponding authority according to the user's identity information.

Voiceprint recognition can also be used in the identity verification process. For example, the reference voiceprint of one or more users who have passed the identity verification is stored in the voiceprint recognition module. The voiceprint recognition module can combine the collected voiceprint with one or more stored Compare the reference voiceprint to determine whether there is a reference voiceprint that matches the collected voiceprint. If so, it can be determined that the user corresponding to the currently collected voiceprint can pass the identity verification; otherwise, it can be determined that the user corresponding to the currently collected voiceprint has not passed the authentication. Authenticated.

In order to help understand this solution, the process of voiceprint recognition in this application is explained in conjunction with the usage scenarios as follows. Exemplarily, the voiceprint management device in this usage scenario may be a vehicle-mounted device (for example, a car machine, a vehicle-mounted computer, etc.), and the vehicle-mounted device may determine the identity information of the user corresponding to the currently collected voiceprint based on the reference voiceprint.

In one example, the vehicle device can determine whether the user has passed the identity verification based on the reference voiceprint. Specifically, the vehicle-mounted device only provides the vehicle owner with the query function of "vehicle violation information". The vehicle-mounted device can store the reference voiceprint a of user A (vehicle owner), and mark that the reference voiceprint a corresponds to the vehicle owner. With reference to the scene shown in Fig. 2 (a), when user A inquires about vehicle violation information, it can be said "inquire about vehicle violation information", at this moment, the vehicle-mounted device can extract the sound in the voice signal "inquire about vehicle violation information". Confirm that the extracted voiceprint matches the reference voiceprint a, and in response to the voice signal, display the current violation information on the display screen, such as "On April 20, 2021, 6 points will be deducted for running a red light". With reference to the scene shown in (b) in Figure 2, when user B (non-car owner) inquires about vehicle violation information, the vehicle-mounted device can extract the voiceprint in user B's voice signal "query vehicle violation information", and determine the extracted If the voiceprint does not match the reference voiceprint a, user B is prompted that the query failed, for example, "Only limited to car owner query" is displayed on the display interface.

It can also be understood from the above example that after determining the user identity information, the vehicle-mounted device can determine different permissions corresponding to different users based on different user identity information. Specifically, car owners have the right to query vehicle violation information, while non-car owners do not have the right to query vehicle violation information. When the vehicle-mounted device determines that the user is the owner of the vehicle, it provides the user with the function of querying the vehicle violation information, and when the vehicle-mounted device determines that the user is not the vehicle owner, it refuses to provide the user with the function of querying the vehicle violation information. In the example where the vehicle-mounted device determines different permissions corresponding to different users according to different user identity information, it is also possible to set the reference voiceprint corresponding to the driver in the vehicle-mounted device. When the user's voiceprint collected by the vehicle-mounted device corresponds to the When the reference voiceprint matches, the on-vehicle device can determine that the current user is the driver, and correspondingly, can provide the current user with the authority corresponding to the driver, for example, can control the driving of the vehicle through voice signals.

In another example, the in-vehicle device provides different recommended content for different users. Specifically, the vehicle-mounted device can store the reference voiceprint a of user A, and record that user A likes rock music; and store the reference voiceprint b of user B, and record that user B likes light music. With reference to the scene shown in (a) in Figure 3, when user A says "turn on the music", the vehicle-mounted device can extract the voiceprint in the voice signal "turn on the music", and compare the extracted voiceprint with the stored reference The voiceprint a is compared with the reference voiceprint b to determine that the extracted voiceprint matches the reference voiceprint a, and a list of rock music is recommended in the display interface. With reference to the scene shown in (b) in Figure 3, when user B says "turn on the music", the vehicle-mounted device can extract the voiceprint in the voice signal "turn on the music", and compare the extracted voiceprint with the stored reference The voiceprint a is compared with the reference voiceprint b to determine that the extracted voiceprint matches the reference voiceprint b, and a light music list is recommended on the display interface.

Of course, in this example, when the vehicle-mounted device determines that the extracted voiceprint matches the reference voiceprint a, it can also directly play rock music; or, when the vehicle-mounted device determines that the extracted voiceprint matches the reference voiceprint b , you can also play light music directly. Alternatively, other implementation manners may also be used, which are not limited in this application.

Combining with the above-mentioned voiceprint management device determining that the user corresponds to different rights according to different user identity information, another usage scenario is exemplarily provided. In this usage scenario, the voiceprint management device may be a user terminal, such as a mobile phone. The reference voiceprint of the organic main user is pre-stored in the mobile phone, and the mobile phone can determine whether the collected voiceprint matches the reference voiceprint, and then determine whether the user corresponding to the currently collected voiceprint is the main user. When the mobile phone is in the locked screen state, the owner user can instruct the mobile phone to unlock and perform corresponding actions through voice signals. For example, in the scene shown in Figure 4, when the owner user wants to use the photo album application in the mobile phone, he can say "open the photo album application" to the mobile phone in the locked screen state, and the display interface of the locked mobile phone can be found in Shown in (a) in Figure 4. The mobile phone obtains the voice signal "open the photo album application" and extracts the voiceprint. When the mobile phone confirms that the extracted voiceprint matches the reference voiceprint, it can respond to the voice signal and open the corresponding photo album application. For the display interface of the mobile phone, see Shown in (b) in Figure 4.

In the above scenario, the voiceprint management device needs to compare and collect voiceprints based on a relatively accurate reference voiceprint, and then perform corresponding actions based on the comparison result (matching or not matching). That is, the accuracy of the reference voiceprint stored in the voiceprint management device will affect the accuracy of voiceprint recognition.

The user's voiceprint may undergo short-term or long-term changes. Short-term changes refer to reversible changes in the user's voiceprint due to temporary external stimuli, such as reversible changes in the user's voiceprint caused by a cold. Permanent changes refer to the irreversible changes in the user's voiceprint caused by the physiological changes of the user. The voiceprint management device needs to update the user's reference voiceprint based on the user's voiceprint that has changed for a long time, instead of updating the user's reference voiceprint based on the user's voiceprint that has undergone short-term changes, so as to improve the accuracy of the reference voiceprint , to improve the accuracy of voiceprint recognition.

In this way, it is necessary to more accurately distinguish whether the voiceprint of the user has changed, and whether the change is a short-term change or a long-term change. Then, based on the user's voiceprint that has changed for a long time, the user's reference voiceprint is updated.

The present application exemplarily provides a voiceprint management method, which can be executed by a voiceprint management device. Exemplarily, the voiceprint management apparatus may be the voiceprint management device exemplarily shown in FIG. 1a.

Exemplarily, the voiceprint management apparatus may be a terminal device, or a component (such as a processing device, a circuit, a chip, etc.) in the terminal device. Wherein, the terminal equipment includes vehicle-mounted equipment (such as a car machine, a vehicle-mounted computer, etc.), and a user terminal (such as a mobile phone, a computer, etc.).

As another example, the voiceprint management device may be a server, or a component (such as a processing device, circuit, chip, etc.) in the server. Wherein, the server may include physical devices such as hosts or processors, virtual devices such as virtual machines or containers, and may also include chips or integrated circuits. The server is, for example, an IoV server, also known as a cloud server, cloud, cloud, cloud server, or cloud controller. The IOV server may be a single server or a server cluster composed of multiple servers, which is not specifically limited.

In the voiceprint management method provided by the present application, based on the reference voiceprint and the reference similarity, it can be determined whether the voiceprint collected in the current time window has passed the verification, and then it can be determined whether the user's reference voiceprint needs to be updated.

In order to better explain the embodiment of the present application, in combination with the corresponding relationship between the voiceprint management process and time shown in Fig. 5, the voiceprint management method is sequentially analyzed in order to obtain the reference voiceprint, obtain the reference similarity and voiceprint verification. The flow is explained below.

It is stated in advance that the following three processes are all for the same user (or called registrant, speaker, etc.), obtain the user's reference voiceprint and reference similarity, and verify the user's voiceprint. Determine whether the user's reference voiceprint needs to be updated. In a specific implementation, the same user can be determined based on technologies such as voiceprint comparison and face recognition.

In voiceprint comparison, the similarity threshold can be used to indicate whether two voiceprints correspond to the same user. It is understood that, regardless of whether the user's voiceprint changes, the similarity between the user's collected voiceprint and the reference voiceprint The degrees are greater than the similarity threshold; while the similarity between the voiceprints of different users is less than the similarity threshold. For example, the reference voiceprint registered by user A is reference voiceprint a, the reference voiceprint registered by user B is reference voiceprint b, and the similarity between reference voiceprint a and reference voiceprint b is less than the similarity threshold. When user A speaks, according to the collected voice signal of user A, the similarity between the determined voiceprint and the reference voiceprint a is greater than the similarity threshold, while the similarity with the reference voiceprint b is less than the similarity threshold, and the determined The collected voiceprint corresponds to user A, not user B.

In face recognition, when the user is speaking can be determined through the face image, the collected voice signal is the user's voice signal, and the voiceprint extracted according to the user's voice signal is the user's voiceprint. It can be determined based on the user's mouth shape that the user is speaking. The user's mouth shape can be opened and/or closed according to a preset rule. If a voice signal corresponding to the preset rule is collected, it can be determined that the current user is speaking, and the acquired What is received is the voice signal sent by the user.

In this application, the user may also be identified in other ways. For example, a user account can be set, and when the user logs in through the account and sends a voice signal, it can be determined that the obtained voice signal and the user corresponding to the account login are the same user. Wherein, the account number can also be replaced by the user's fingerprint. When the user sends a voice signal after logging in through the fingerprint, it can be determined that the obtained voice signal and the user corresponding to the fingerprint login are the same user.

In this application, one or more of the above-mentioned voiceprint comparison, face recognition, and account verification can also be combined to determine the same user, so as to improve the accuracy of determining the same user.

For convenience of description, in the following embodiments, the same user targeted is referred to as the first user.

1. Obtaining the benchmark voiceprint process:

The first user may register a voiceprint at a registration time point (t0 as shown in FIG. 5(a)), where the voiceprint registered by the first user may be called a registered voiceprint or a reference voiceprint. In a possible implementation, a preset text may be displayed on the display interface, and the first user reads the preset text to obtain the current voice signal of the first user, and extracts the voiceprint based on the voice signal to obtain the first user's voiceprint. The reference voiceprint of the first user is stored.

2. Obtaining the benchmark similarity process:

Since the reference voiceprint is extracted based on a segment of the voice signal of the first user, in order to avoid the inaccuracy of the reference voiceprint due to accidental reasons, the application can further obtain the reference similarity of the second preset duration after the registration time point, The benchmark similarity and the benchmark voiceprint can be used together as a benchmark parameter to determine whether the benchmark voiceprint needs to be updated. For details, please refer to the description in the following voiceprint verification process.

In the present application, the period after the registration time point and corresponding to the second preset duration may be referred to as a reference time period, a reference time window, a second time window, and the like. Wherein, the second time window can perform a sliding operation under specific circumstances, and the actual duration of the second time window is variable.

In the specific acquisition process, the voice signal of the first user in the second time window (t0-t1 as shown in Figure 5 (a)) can be acquired, and the voiceprint is extracted from the voice signal, and the voiceprint is the The voiceprint of the first user collected in the second time window (may be referred to as the second voiceprint). The second voiceprint can be compared with the reference voiceprint. Specifically, the similarity between the second voiceprint and the reference voiceprint can be determined according to a similarity algorithm (such as cosine similarity algorithm, Euclidean distance algorithm, etc.).

Further, multiple second voiceprints may be collected in the second time window, that is, similarities between the multiple second voiceprints and the reference voiceprint may be determined respectively. In an example, an average value among the multiple similarities may be used as a reference similarity corresponding to the second time window (which may be referred to as an average similarity). Of course, in this application, the median of the multiple similarities, or the average value between the maximum value and the minimum value, may also be used as the reference similarity degree corresponding to the second time window, or in other ways. As follows, the average similarity corresponding to the second time window is used as an example for illustration. Of course, in this application, the average similarity can be replaced by the benchmark similarity to represent the same meaning.

Here, in order to improve the accuracy of the average similarity of the second time window, a second threshold can be preset. When the times threshold is reached, the duration of the second time window may be automatically extended until it is determined that the number of second voiceprints in the second time window reaches the second times threshold.

Exemplarily, the threshold for the second number of times is 10, which is only 8 second voiceprints collected within t0-t1 shown in (a) of FIG. 5 . In order to improve the accuracy of the average similarity of the second time window, the duration of the second time window can be automatically extended, for example, the 10th second voiceprint is not collected until t1' is reached, then the second time window can be determined The termination point of is t1', that is, the second time window is t0-t1'. Further, the average similarity in the second time window may be determined based on the similarities between the 10 second voiceprints and the reference voiceprint respectively.

3. Voiceprint verification process:

In this application, multiple voiceprints of the first user in the verification time period (may be referred to as the first time window) may be acquired, and based on the acquired multiple voiceprints, it is determined whether the voiceprint of the first user has changed permanently, That is, whether it is necessary to update the reference voiceprint of the first user based on the voiceprint of the first user after a long-term change.

In the present application, the starting point of the first time window is later than the starting point of the second time window, and the first time window may partially overlap or not overlap with the second time window. The first time window may be set as a first preset duration, and the first preset duration may be equal to or different from the second preset duration. Further, the first time window may perform a sliding operation under specific circumstances, and the actual duration of the first time window is variable.

Exemplarily, FIG. 6 exemplarily shows a schematic flow diagram of voiceprint verification, in this flow:

Step 601, acquire a first speech signal in a first time window.

The first voice signal of the first user in the first time window (t1-t2 as shown in Figure 5(a)) can be obtained, and according to the voiceprint extraction process exemplarily shown in Figure 1c, from the first voice signal The first voiceprint of the first user is extracted from the signal.

Step 602, extracting the first voiceprint from the first voice signal, and determining the verification result of the first voiceprint.

The first voiceprint can be compared with the reference voiceprint to determine the verification result of the first voiceprint. Specifically, the similarity between the first voiceprint and the reference voiceprint can be determined, and when the similarity between the first voiceprint and the reference voiceprint is greater than the average similarity corresponding to the second time window, record the first voiceprint If the similarity between the first voiceprint and the reference voiceprint is not greater than the average similarity corresponding to the second time window, it is recorded that the first voiceprint has failed the verification.

In an optional manner, a bit can be used to record whether the corresponding first voiceprint passes the verification, for example, when the first voiceprint passes the verification, the value of the corresponding bit is 1; when the first voiceprint When the verification is not passed, the value of the corresponding bit of the record is 0. In the above technical solution, there is no need to store a plurality of first voiceprints, but only verification results of a plurality of first voiceprints need to be stored, and each verification result can occupy one bit, thereby helping to reduce storage space.

Step 603: Determine whether to update the reference voiceprint according to the verification result of the first voiceprint in the first time window. If no update is required, the voiceprint verification process can be performed again. If an update is required, the voiceprint update process can be performed.

Multiple first voiceprints may be collected in the first time window, that is, verification results of multiple first voiceprints may be determined. Whether to update the reference voiceprint of the first user may be determined based on the verification results of multiple first voiceprints.

In an optional example, the ratio of the verified first voiceprints (which may be referred to as a compliance rate) may be counted, that is, the ratio of the number of verified first voiceprints to the number of first voiceprints may be counted. The ratio is greater than the ratio threshold, which means that the current voiceprint of the first user has not changed for a long time, and there is no need to update the reference voiceprint of the first user. The ratio is less than or equal to the ratio threshold, which means that the current voiceprint of the first user has undergone permanent changes, and the reference voiceprint of the first user needs to be updated.

For example, assume that the average similarity corresponding to the second time window is 0.85, and the ratio threshold is 70%. A total of 5 first voiceprints are obtained in the first time window, and according to the 5 first voiceprints and the reference voiceprint, the similarities between the 5 first voiceprints and the reference voiceprint are determined to be, for example, 0.90, 0.90, 0.90, 0.80, 0.86, then the verification results of the five first voiceprints can be obtained as 1, 1, 1, 0, 1 respectively, and then the ratio of the first voiceprints that passed the verification in the first time window is determined to be 80%. (greater than 70% of the ratio threshold), there is no need to update the reference voiceprint of the first user.

Here, in order to improve the accuracy of the verification, the first number threshold can be preset, and when it is determined that the number of verification results of the first voiceprint in the first time window of the first preset duration is less than the first number threshold , the duration of the first time window may be automatically extended until it is determined that the number of verification results of the first voiceprint in the first time window reaches the first number threshold. Exemplarily, the threshold of the first number is 10, compared to the 8 first voiceprints collected within t1-t2 shown in (a) in Figure 5, the duration of the first time window can be automatically extended, such as The tenth first voiceprint is not collected until t2' is reached, and it can be determined that the end point of the first time window is t2', that is, the first time window is t1-t2'. Further, it may be determined whether to update the reference voiceprint based on the verification results of the 10 first voiceprints.

In this step, it is also possible to further determine the average value of the similarities between a plurality of first voiceprints in the first time window and the reference voiceprint (which may be referred to as the average similarity of the first time window), the first time window The average similarity of is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint. The average similarity in the first time window may be used for subsequent updating of the reference voiceprint, for details, please refer to the following embodiments.

Step 604, slide the first time window.

In an optional manner, the first time window may be slid backward for a third preset duration, and the third preset duration may be shorter than the first preset duration, or may be shorter than the second preset duration. The end point of the first time window after sliding is later than the end point of the first time window before sliding. For example, see (a) and (b) in Figure 5, the first time window slides from t1-t2 before sliding to t3-t4, where the interval between t2 and t4 is the third preset duration. In a specific implementation, the first preset duration may be 7 days, the second preset duration may be 7 days, and the third preset duration may be 1 day.

Since in some embodiments, the length of the first time window may be extended until the number of verification results of the first voiceprint in the first time window reaches the first number threshold, that is, the length of the first time window is variable Yes, there may be a situation where the duration of the first time window before sliding is greater than the first preset duration. In this case, the end point of the first time window can be slid backward for a third preset duration, and then the first time after sliding can be determined based on the end point of the first time window after sliding and the first preset duration The starting point of the window.

Referring to FIG. 7 as an example, for example, within the time period t1-t2 (the interval between t1 and t2 is the first preset duration), the number of verification results of the first voiceprint determined in the first time window does not reach the first The number of times threshold, the end point of the first time window is moved back to t2' (the interval between t1 and t2' is greater than the first preset time length). When sliding the first time window, the time point corresponding to t2'+the third preset duration can be used as the end point of the first time window after sliding, for example, it is expressed as t4'. Subsequently, it can be determined that the starting point of the first time window after sliding is t4'-the first preset duration, for example, it is expressed as t3'.

Further, the first voiceprint of the first user may be continuously collected in the first time window after sliding, the similarity between the collected first voiceprint and the reference voiceprint is determined, and then the verification result of the first voiceprint is determined. The verification results of multiple first voiceprints may be determined in the first time window after sliding, and it is determined whether the reference voiceprint of the first user needs to be updated according to the verification results of the multiple first voiceprints. That is, after step 604, the above steps 601 to 603 may be continued until the reference voiceprint of the first user is updated.

Exemplarily, referring to (a) to (c) in FIG. 5 , the first time window may gradually slide from t1-t2 to t5-t6. Further, for example, when sliding the first time window to t5-t6, according to the verification results of multiple first voiceprints obtained in t5-t6, it is determined that the voiceprint update process needs to be executed, that is, step 605 can be started at t6 Go to step 606, the voiceprint update process.

Step 605, acquire the second voice signal, wherein the difference between the similarity between the second voice signal and the reference voiceprint and the average similarity in the first time window is less than the difference threshold.

After the voiceprint update process is started, the voice signal of the first user can be collected to determine whether the voice signal meets the preset condition, and if so, the voice signal is used to update the reference voiceprint, otherwise, the voice signal is discarded.

Exemplarily, after the voiceprint update process is started, the voiceprint is determined according to the collected voice signal, and the similarity between the voiceprint and the reference voiceprint is determined. If the difference between the similarity and the average similarity in the first time window is smaller than the difference threshold, the collected voice signal (ie, the second voice signal) may be used to update the reference voiceprint. If the difference between the similarity and the average similarity in the first time window is not less than the difference threshold, the collected speech signal may be discarded. Exemplarily, the difference threshold is 0.1.

Step 606: Update the reference voiceprint according to the second voice signal.

In this application, one or more second voice signals may be obtained after the voiceprint update process is started. A deduplication operation, or a deduplication operation and splicing operation may be performed on one or more second voice signals to obtain the deduplication operation, or the voice signal after the deduplication operation and splicing operation, and then the voiceprint is extracted therefrom.

In this application, it is considered that when the first user sends a voice signal, there may be high-frequency words in the voice signal, such as wake-up words, and the excessive occurrence of such high-frequency words may affect the accuracy of voiceprint extraction from the voice signal. In this way, it is necessary to perform a deduplication operation on the second speech signal. Specifically, ASR may be performed on the second speech signal to obtain the text corresponding to the second speech signal, and a deduplication operation is performed on the text corresponding to the second speech signal to obtain the non-existent text. Text that repeats text (may be referred to as deduplicated text). Further, based on the deduplication text and the second voice signal, a voice signal after the deduplication operation (which may be called a deduplication voice signal) is obtained.

Further, in order to improve the accuracy of updating the voiceprint, an update condition can be preset, and when the de-emphasis voice signal meets the update condition, the reference voiceprint can be updated according to the de-emphasis voice signal. Exemplarily, the update condition may be that the duration of the de-emphasized speech signal is greater than the duration threshold, and/or the number of characters in the non-repetitive text (also referred to as de-duplicated text) corresponding to the de-emphasized voice signal is greater than the word-count threshold. For the convenience of description, in the following examples, the update condition is that the duration of the de-emphasized speech signal is greater than the duration threshold, and that the number of words in the non-repetitive text corresponding to the de-emphasized speech signal is greater than the word-number threshold.

Exemplarily, as shown in (d) in FIG. 5 , the process of updating the reference voiceprint is started at t6, and then one or more second voice signals are collected, until t7 to obtain a de-emphasized voice signal that meets the update condition, and at t7 according to the de-duplication Heavy speech signals update the reference voiceprint. t6-t7 may be understood as a third time window, and the third time window is used to acquire one or more second voice signals.

The above step 605 and step 606 are explained below in combination with a schematic flow chart of updating a reference voiceprint provided in FIG. 8 .

Step 801, acquire the first second voice signal.

Step 802, according to the first second voice signal, determine the de-emphasis voice signal corresponding to the first second voice signal.

Specifically, perform ASR on the first second voice signal to obtain the text corresponding to the first second voice signal; perform a deduplication operation on the text corresponding to the first second voice signal to obtain the first second voice signal The corresponding deduplication text (may be referred to as the first deduplication text).

Exemplarily, for example, if the first user wants to wake up a certain device, and the wake-up word of the device is "Little A", the first user can send the second voice signal "Little A, Little A", and after the second voice signal is collected, After the speech signal, ASR processing is performed on the second speech signal to obtain the text corresponding to the second speech signal, that is, "小A小A". Further, if it is determined that "Little A Xiao A" includes the repeated text "Little A", then the deduplication operation may be performed, and the obtained first deduplicated text is "Little A". Similarly, if the first user sends a second voice signal "Little A, little A, please turn on the air conditioner", then the first deduplication text obtained according to the second voice signal is "Little A, please turn on the air conditioner".

Of course, there may not be repeated text in the text corresponding to the second voice signal. For example, the first user sends the second voice signal "Hello, little A", or "Little A, please turn on the air conditioner", or "Little A, please Turn on the music" etc. At this time, after the ASR is performed on the second speech signal, it is determined that there is no repeated text in the text corresponding to the second speech signal, and the text corresponding to the second speech signal is directly used as the first deduplication text.

Further, the speech signal corresponding to the first deduplication text in the first second speech signal can be determined, so as to obtain the speech signal after the deduplication operation is performed on the first second speech signal (that is, the first deemphasis speech signal ).

Taking the above-mentioned first second voice signal as "Little A, little A, please turn on the air conditioner" as an example, assuming "little A, little A, please turn on the air conditioner" and the voice signal segment in the first second voice signal The corresponding relationship between them is shown in Table 1.

Table 1

文本text	语音信号片段speech signal segment
小A(第一个)Small A (first)	语音信号片段11Speech Signal Segment 11
小A(第二个)Small A (second)	语音信号片段12Speech signal segment 12
请打开空调please turn on the air conditioner	语音信号片段13Speech signal segment 13

After determining that the first deduplication text is "Little A, please turn on the air conditioner", the first deduplication text can be determined according to the corresponding voice signal segments in the first second voice signal of the first deduplication text "Little A, please turn on the air conditioner". De-emphasized speech signals. In combination with Table 1, the speech signal segment 11 corresponding to "little A (the first)" and the speech signal segment 13 corresponding to "please turn on the air conditioner" are concatenated to obtain the first de-emphasis speech signal.

It should be noted that, in this example, since the speech signal segment corresponding to "little A" in the first second speech signal in the first deduplication text can be the speech signal segment 11 or the speech signal segment 12, then in When splicing, either the speech signal segment 11 and the speech signal segment 13 can be selected to be spliced to obtain the first de-emphasis speech signal, or the speech signal segment 12 and the speech signal segment 13 can be selected to be spliced to obtain the first de-emphasis speech signal.

It should also be noted that the above is only an example of how to obtain the de-emphasized voice signal from the second voice signal according to the de-emphasized text. When the second voice signal is divided according to text, it may also be as shown in Table 2. Certainly, the second voice signal may also be divided into other manners, which are not limited in this application.

Table 2

文本text	语音信号片段speech signal segment
小A(第一个)Small A (first)	语音信号片段11Speech Signal Segment 11
小A(第二个)Small A (second)	语音信号片段12Speech signal segment 12
请打开please open	语音信号片段14Speech signal segment 14
空调air conditioner	语音信号片段15Speech signal segment 15

In step 803, if it is determined that the first de-emphasized speech signal meets the update condition, then step 807 is executed. If it is determined that the first de-emphasized speech signal does not meet the update condition, step 804 is performed.

Step 804, acquiring the second second voice signal.

Step 805, according to the second second voice signal and the first de-emphasis text, determine the de-emphasis voice signal corresponding to the first second voice signal and the second second voice signal.

It should be noted that, since the first second speech signal has been obtained before the second second speech signal is obtained, and the first second speech signal is processed to obtain the first deduplication text, the first deduplication The repeated text may be used as the historical deduplicated text of the second second speech signal. After ASR processing is performed on the second second speech signal, and the text corresponding to the second second speech signal is obtained, the text corresponding to the second second speech signal is deduplicated according to the history (first deduplication text). The deduplication operation is performed on the text to obtain the deduplication text (which may be referred to as the second deduplication text) corresponding to the first second voice signal and the second second voice signal.

In an optional manner, it is possible to perform a deduplication operation on the text corresponding to the second second voice signal first, and then combine the text corresponding to the second second voice signal after the deduplication operation with the first deduplication text Carry out comparison, and then in the text corresponding to the second second voice signal after the deduplication operation, delete the text repeated with the first deduplication text, splice the obtained text into the first deduplication text, and obtain the first deduplication text Two deduplicated text.

Exemplarily, for example, the first deduplication text is "little A, please turn on the air conditioner". The second second voice signal is "Little A, little A, the high point of the air conditioner", execute ASR on the second second voice signal, and get the text "Little A, small A, the high point of the air conditioner", after performing the deduplication operation Get the text "Little A's air conditioner turns on high". Subsequently, according to the first deduplication text "little A, please turn on the air conditioner" and "little A's air conditioner at a high point", the second deduplication text is "little A, please turn on the air conditioner at a high point".

In yet another optional manner, the text corresponding to the second second speech signal may be spliced to the first deduplication text, and then the deduplication operation is performed on the spliced text to obtain the second deduplication text.

Exemplarily, for example, the first deduplication text is "little A, please turn on the air conditioner". The second second voice signal is "little A, little A, the air conditioner is turned on high", and ASR is performed on the second second voice signal to obtain the text "little A, small A, the air conditioner is turned on high". After splicing "little A, small A, the high point of the air conditioner" to the first deduplication text as "little A, please turn on the air conditioner", get "little A, please turn on the air conditioner, small A, small A, the high point of the air conditioner", and then perform the deduplication operation , get "Little A, please turn on the high point of the air conditioner".

Of course, there may be other ways of deduplication operation and splicing operation, and this application will not list them one by one.

After the second deduplication text is obtained, the speech signal corresponding to the second deduplication text in the first second speech signal and the speech signal corresponding to the second second speech signal can be determined, so as to obtain the first The second voice signal and the second second voice signal are de-emphasized and spliced to the voice signal (that is, the second de-emphasized voice signal).

For example, the first second voice signal above is "Little A, little A, please turn on the air conditioner", and the second second voice signal is "Little A, little A, turn on the air conditioner". Assume that the corresponding relationship between "little A, little A, please turn on the air conditioner" and the speech signal segment in the first second speech signal is shown in Table 1. Table 3 shows the corresponding relationship between "small A, small A, high point of air conditioner" and the speech signal segment in the second second speech signal.

table 3

文本text	语音信号片段speech signal segment
小A(第一个)Small A (first)	语音信号片段21Speech signal segment 21
小A(第二个)Small A (second)	语音信号片段22Speech signal segment 22
空调开air conditioner on	语音信号片段23Speech signal segment 23
高点High Point	语音信号片段24Speech signal segment 24

After determining that the second deduplication text is "Little A, please turn on the high point of the air conditioner", the corresponding voice signal segment in the first second voice signal can be selected according to the second deduplication text "Please turn on the high point of the air conditioner" , and the corresponding speech signal segment in the second second speech signal to determine the second de-emphasis speech signal. For example, the voice signal segment 11 corresponding to "little A (the first)", the voice signal segment 13 corresponding to "please turn on the air conditioner", and the voice signal segment 24 corresponding to "high point" are spliced into the second deduplication voice signal. Of course, the determined second de-emphasis voice signal may also be formed by concatenating the voice signal segment 12 , the voice signal segment 13 and the voice signal segment 24 .

After the second de-emphasis speech signal is acquired, it may be determined whether the second de-emphasis speech signal meets the update condition (that is, the judging condition in step 803 is performed again), and if so, step 807 is performed.

If it is determined that the second de-emphasis voice signal does not meet the update condition, then continue to obtain the third second voice signal, perform ASR processing on the third second voice signal, and after obtaining the text corresponding to the third second voice signal, according to The historical deduplication text (the second deduplication text) performs a deduplication operation on the text corresponding to the 3rd second voice signal, and obtains the deduplication text corresponding to the 1st second voice signal to the 3rd second voice signal ( may be referred to as the third deduplication text). According to the third de-emphasis text, a de-emphasis speech signal (may be referred to as a third de-emphasis speech signal) is determined from the first second speech signal to the third second speech signal. Determine whether the third de-emphasis voice signal meets the update condition.

By repeating the above operations, step 806 can be performed to obtain the i-th second voice signal, and determine the first to second voice signals from the first second voice signal to the i-th second voice signal according to the i-th second voice signal and the historical deduplication text The de-emphasis voice signals corresponding to the i second voice signals, wherein the historical de-emphasis text is determined according to the texts respectively corresponding to the 1st second voice signal to the i-1th second voice signal. Until the obtained de-emphasis speech signal meets the update condition.

In order to better explain the embodiment of the present application, it is explained in conjunction with Table 4. Assume that the update condition is that the duration of the de-emphasis speech signal is greater than 5s (the instant length threshold is 5s), and the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than 12 (that is, the word count threshold is 12). During the voiceprint update process:

(1) Obtain the first second voice signal as "Hello Little A", execute ASR to get the text "Hello Little A" corresponding to the first second voice signal, since there is no repeated text in this text, that is, the first The first deduplication text is "Hello Little A", the number of characters is 4, and the duration of the first deduplication voice signal is 1.5s. At this time, the number of words in the first de-emphasis text is not more than 12, and the duration of the first de-emphasis voice signal is not more than 5s, and the second second voice signal can be further obtained.

(2) Obtain the second second voice signal as "turn on the air conditioner", execute ASR to get the text "turn on the air conditioner" corresponding to the second second voice signal, and deduplicate the text "Hello little A" to "turn on the air conditioner" according to the history " Perform the deduplication operation and splicing operation to obtain the second deduplication text "Hello little A turns on the air conditioner", the number of words is 8, and the corresponding second deduplication voice signal has a duration of 3s. At this time, if the number of words in the second de-emphasis text is not greater than 12, and the duration of the second de-emphasis voice signal is not more than 5s, then the third second voice signal can be further obtained.

(3) Obtain the third second voice signal as "air conditioner opening high point", execute ASR to get the text "air conditioner opening high point" corresponding to the third second voice signal, and deduplicate the text "Hello little A open" according to the history "Air conditioner" performs de-duplication and splicing operations on the "air-conditioning high point" to obtain the third de-duplication text "Hello little A turns on the high point of the air-conditioning", the number of words is 10, and the corresponding third de-duplication voice signal has a duration of 3.5 s. At this time, the number of characters of the third de-emphasis text is not more than 12, and the duration of the third de-emphasis voice signal is not more than 5s, then the fourth second voice signal can be further obtained.

(4) Obtain the 4th second voice signal as "Hello Little A", execute ASR to get the text "Hello Little A" corresponding to the 4th second voice signal, and open the duplicate text "Hello Little A" according to the history Air-conditioning high point" performs de-duplication and splicing operations on "Hello little A", and obtains the fourth de-duplication text "Hello little A turns on the air-conditioning high point", the number of words is 10, and the corresponding duration of the fourth de-duplication voice signal It is 3.5s. At this time, if the number of characters of the fourth de-emphasis text is not greater than 12, and the duration of the fourth de-emphasis voice signal is not more than 5s, then the fifth second voice signal can be further obtained.

(5) Obtain the fifth second voice signal as "open the sunroof", execute ASR to get the text "open the sunroof" corresponding to the fifth second voice signal, and deduplicate the text "Hello little A, turn on the high point of the air conditioner" according to the history Perform deduplication and splicing operations on "open the sunroof", and obtain the fifth deduplication text "Hello little A opens the high-point sunroof of the air conditioner", the number of words is 12, and the duration of the corresponding fifth deduplication voice signal is 4s. At this time, the number of characters of the fifth de-emphasis text is not more than 12, and the duration of the fifth de-emphasis voice signal is not more than 5s, then the sixth second voice signal can be further obtained.

(6) Obtain the 6th second voice signal as "play music", execute ASR to get the text "play music" corresponding to the 6th second voice signal, and deduplicate the text "Hello, little A, open the air-conditioning high point sunroof" according to the history "Deduplication and splicing operations are performed on "playing music" to obtain the sixth deduplication text "Hello little A opens the air-conditioning high-point sunroof to play music", the number of words is 16, and the corresponding sixth deduplication voice signal has a duration of 5.5 s. At this time, the number of characters of the sixth de-emphasis text is greater than 12, and the duration of the sixth de-emphasis voice signal is greater than 5s.

In this way, it is determined that the sixth de-emphasis speech signal meets the update condition, and step 807 is performed according to the sixth de-emphasis speech signal.

Table 4

the	第二语音信号second voice signal	去重文本deduplicated text	字数word count	时长duration
11	你好小AHello Little A	你好小AHello Little A	44	1.5s1.5s
22	打开空调on the aircon	你好小A打开空调Hello little A, turn on the air conditioner	88	3s3s
33	空调开高点air conditioner high	你好小A打开空调高点Hello little A, turn on the air conditioner higher	1010	3.5s3.5s
44	你好小AHello Little A	你好小A打开空调高点Hello little A, turn on the air conditioner higher	1010	3.5s3.5s
55	打开天窗open the sunroof	你好小A打开空调高点天窗Hello little A, turn on the air-conditioning high point sunroof	1212	4s4s
66	播放音乐play music	你好小A打开空调高点天窗播放音乐Hello Little A, turn on the air conditioner and play music on the high sunroof	1616	5.5s5.5s

Step 807: Determine the third voiceprint of the first user according to the de-emphasized voice signal meeting the update condition.

Exemplarily, the voiceprint extraction process exemplarily shown in FIG. 1c is executed according to the de-emphasized voice signal meeting the update condition, to obtain the third voiceprint of the first user.

Step 808, update the reference voiceprint of the first user according to the third voiceprint.

In an optional implementation manner, the reference voiceprint of the first user may be actively updated, that is, the original reference voiceprint is replaced by the third voiceprint. In another optional implementation manner, the first user may be prompted whether to update the reference voiceprint.

In the way of prompting the first user whether to update the reference voiceprint, it can be explained in conjunction with the example in Figure 2, the first user is user A (car owner), and the vehicle-mounted device determines that the voiceprint of user A has changed permanently, then it can display in the display interface Prompt user A whether to automatically update the reference voiceprint.

Exemplarily, as shown in FIG. 9 , the in-vehicle device displays a prompt message "It is detected that your voiceprint has changed, do you want to update the detected voiceprint to the original voiceprint?" on the display interface. If user A clicks "OK", the in-vehicle device can replace the original reference voiceprint with the obtained third voiceprint. If user A clicks "No, I want to update by myself", the vehicle-mounted device can further display a preset text on the display interface, and prompt user A to read the preset text, and obtain the current voice signal of user A, based on the The voiceprint is extracted from the voice signal to obtain the reference voiceprint of user A, and the reference voiceprint of user A is stored. Of course, in some other embodiments, when it is determined that the reference voiceprint needs to be updated according to the verification result of the first voiceprint in the first time window, the user A may be prompted on the display interface to update the reference voiceprint by himself.

After the reference voiceprint is updated, the second time window can be slid to obtain a slid second time window. Wherein, the starting point of the second time window after sliding is not earlier than the time point of updating the reference voiceprint; or, the starting point of the second time window after sliding is not earlier than the time point of acquiring the second voice signal.

Exemplarily, the starting point of the second time window may be after the ending point of the third time window, or coincide with the ending point of the third time window. As shown in (e) of FIG. 5 , the starting point of the second time window after sliding coincides with the ending point of the third time window. One or more second voiceprints in the second time window after sliding may be further obtained in the second time window after updating the reference voiceprint (ie, the second time window after sliding), according to one or more first time windows The similarities between the two voiceprints and the updated reference voiceprint respectively determine the average similarity in the second time window after sliding.

Further, after the average similarity of the second time window is determined, the first time window is slid, and the end point of the slid first time window is later than the start point of the slid second time window. Exemplarily, the starting point of the first time window after sliding may be after the ending point of the second time window after sliding, or coincide with the ending point of the second time window after sliding. As shown in (e) of FIG. 5 , the starting point of the first time window after sliding coincides with the ending point of the second time window after sliding. The first time window after sliding is specifically t8-t9, wherein the interval between t8-t9 is a first preset time length.

Obtain one or more first voiceprints in the first time window after sliding, according to the similarity between one or more first voiceprints and the updated reference voiceprint and the average similarity of the second time window after sliding Determine the verification results of one or more first voiceprints, and then determine whether to update the reference voiceprint again.

In conjunction with the schematic flow chart of another voiceprint management method shown in Fig. 10, the explanation is as follows:

The voice signal of the first user is acquired, and the voice signal of the first user is pre-processed to obtain audio feature information of the voice signal. Perform audio feature extraction according to the audio feature information to determine the voiceprint of the first user. Perform post-processing to obtain the final voiceprint of the first user. In one case, register or update the voiceprint of the first user as a reference voiceprint. In another case, it may be determined that the current time point is located in the first time window, or in the second time window.

When the current time point is in the first time window, compare the voiceprint of the first user (that is, the first voiceprint) with the reference voiceprint to obtain the similarity between the first voiceprint and the reference voiceprint, if the If the similarity is greater than the average similarity in the second time window, it is determined that the first voiceprint has passed the verification. Determine the ratio of the verified first voiceprint in the first time window (i.e. the compliance rate in the first time window), if the compliance rate is greater than the ratio threshold, then slide the first time window; if the compliance rate is less than or equal to the ratio threshold, Then start the process of updating the reference voiceprint.

When the current time point is within the second time window, the voiceprint of the first user (that is, the second voiceprint) is compared with the reference voiceprint to obtain the similarity between the second voiceprint and the reference voiceprint. In the second time window, an average value of the similarities between the multiple second voiceprints and the reference voiceprint is determined as the average similarity.

It should be added that multiple thresholds are involved in this application, such as ratio threshold, similarity threshold, difference threshold, number of times threshold, duration threshold, word count threshold, etc. In the judgment process according to the threshold, when it is greater than or equal to the threshold, it corresponds to the first result; when it is less than the threshold, it corresponds to the second result; it can also be when it is greater than the threshold, Corresponding to the first result may be when it is less than or equal to the threshold, corresponding to the second result, which is not limited in this application. For example, in the process of determining whether to update the first user's reference voiceprint, although it is pointed out in the embodiment that when the ratio is greater than the ratio threshold, there is no need to update the first user's reference voiceprint (i.e. the first result); When it is equal to the ratio threshold, update the first user’s reference voiceprint (i.e. the second result), but in this application it is also possible that when the ratio is greater than or equal to the ratio threshold, there is no need to update the first user’s reference voiceprint (i.e. the first result). ); when the ratio is less than the ratio threshold, update the first user's reference voiceprint (that is, the second result).

What needs to be added is that in the process of the voiceprint management device acquiring the text corresponding to the second voice signal, the semantic understanding device may execute ASR to obtain the text corresponding to the second voice signal, and then the voiceprint management device obtains the text corresponding to the second voice signal from the semantic understanding device The text corresponding to the second voice signal obtained in the present application is not limited.

It should be understood that in the above technical solution, the reference voiceprint of the first user is determined in advance, and then the second voiceprint of the first user is obtained in the second time window, and the similarity between the second voiceprint and the reference voiceprint is determined. Multiple similarities in two time windows, determining an average similarity in the second time window. Since the second time window is after the time point when the reference voiceprint is stored, that is, during the time period from the time point when the user registers the reference voiceprint to the end point of the second time window, the user's voiceprint is in a steady change, which can be based on The voiceprint of the first user that is changing steadily is used to obtain a reference parameter (that is, the average similarity between the reference voiceprint and the second time window). In the subsequent process of determining whether the user's voiceprint has undergone permanent changes, the similarity between the first voiceprint and the reference voiceprint can be determined based on the user's first voiceprint in the first time window, and the similarity can be compared with the second Compare the average similarity of the time window to determine the verification result of the first voiceprint, that is, determine whether the first voiceprint in the first time window is similar to the second voiceprint in the second time window, and then determine the start of the second time window During the time period from the start point to the end point of the first time window, whether the user's voiceprint has undergone a long-term change.

The various embodiments described herein may be independent solutions, or may be combined according to internal logic, and these solutions all fall within the protection scope of the present application.

It can be understood that, in the above method embodiments, the methods and operations implemented by the voiceprint management device may also be implemented by components (such as chips or circuits) that can be used in the voiceprint management device.

The division of modules in the embodiment of the present application is schematic, and is only a logical function division, and there may be other division methods in actual implementation. In addition, each functional module in each embodiment of the present application may be integrated into one processor, or physically exist separately, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.

Based on the above content and the same idea, Fig. 11 and Fig. 12 are schematic structural diagrams of possible voiceprint management devices exemplarily provided by the present application. These voiceprint management devices can be used to realize the functions of the above-mentioned method embodiments, and therefore can also realize the beneficial effects possessed by the above-mentioned method embodiments.

As shown in FIG. 11 , the voiceprint management device provided by this application includes: an acquisition module 1101 and a processing module 1102 .

Exemplarily, the acquisition module 1101 can be used to perform the acquisition function of the voiceprint management device in the relevant method embodiments in FIG. 6, FIG. 8 or FIG. Steps for speech signals, etc. Exemplarily, the processing module 1102 can be used to execute the processing functions of the voiceprint management device in the relevant method embodiments in FIG. 6, FIG. 8 or FIG. A voiceprint judgment step, or a step for updating a reference voiceprint, etc.

In a possible implementation manner, it is used to obtain the verification result of the first voiceprint in the first time window, wherein the verification result of the first voiceprint is based on the results of the first voiceprint, the reference voiceprint, and the second time window. determined by the average similarity, the first voiceprint is determined according to the first voice signal of the first user acquired in the first time window, and the average similarity of the second time window is used to indicate the voiceprint acquired in the second time window In the case where the voiceprint of the first user is similar to the reference voiceprint, the starting point of the first time window is later than the starting point of the second time window; the processing module 1102 is configured to update the voiceprint according to the verification result of the first voiceprint in the first time window Baseline voiceprint.

In a possible implementation manner, the processing module 1102 is specifically configured to: update the reference voiceprint according to the ratio of the verified first voiceprint in the first time window.

In a possible implementation manner, the processing module 1102 is specifically configured to: update the reference voiceprint when the ratio of the verified first voiceprint in the first time window is less than or equal to a ratio threshold.

In a possible implementation, the processing module 1102 is specifically configured to: control the acquisition module 1101 to acquire the second voice signal of the first user, wherein the similarity between the second voice signal and the reference voiceprint is the same as that of the first time window. The difference of the average similarity is less than the difference threshold, and the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; according to the second voice signal, Update the baseline voiceprint.

In a possible implementation manner, the processing module 1102 is specifically configured to: obtain the de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, wherein the text corresponding to the de-emphasis voice signal is non-repetitive text; Deduplicate the voice signal and update the reference voiceprint.

In a possible implementation manner, there is one second voice signal, and the processing module 1102 is specifically configured to: perform a deduplication operation on the text corresponding to the second voice signal to obtain deduplication text; The corresponding speech signal in the middle is obtained to obtain the de-emphasis speech signal.

In a possible implementation, there are multiple second voice signals, and the processing module 1102 is specifically configured to: for the ith second voice signal among the multiple second voice signals: The text corresponding to the second voice signal performs a deduplication operation; after the deduplication operation, the text corresponding to the ith second voice signal is spliced with the historical deduplication text to obtain the 1st voice signal to the ith second voice The deduplication text corresponding to the signals; wherein the historical deduplication text is obtained according to the text corresponding to the 1st second voice signal to the i-1th second voice signal, i is greater than 1; according to the common The corresponding de-emphasis text is a corresponding voice signal among the plurality of second voice signals to obtain a de-emphasis voice signal.

In a possible implementation manner, the processing module 1102 is further configured to: slide the first time window, wherein an end point of the first time window after sliding is later than an end point of the first time window before sliding.

In a possible implementation, the processing module 1102 is further configured to: slide the second time window, where the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the end point of the second time window after sliding is The start point of the second time window of is earlier than the start point of the first time window after sliding.

In a possible implementation, before the obtaining module 1101 obtains the verification result of the first voiceprint in the first time window, the processing module 1102 is further configured to: Determine the similarity in the second time window of the second voiceprint; determine the average similarity in the second time window according to the multiple similarities in the second time window.

As shown in FIG. 12 , the device provided by the embodiment of the present application is shown. The device shown in FIG. 12 may be a hardware circuit implementation manner of the device shown in FIG. 11 . The apparatus may be applicable to the flow chart shown above to execute the above-mentioned method embodiment.

For ease of illustration, only the main components of the device are shown in FIG. 12 .

The voiceprint management device includes: a processor 1210 and an interface 1230 , and optionally, the voiceprint management device further includes a memory 1220 . The interface 1230 is used to implement communication with other devices.

The methods performed by the voiceprint management device in the above embodiments can be implemented by calling the program stored in the memory (which may be the memory 1220 in the voiceprint management device or an external memory) by the processor 1210. That is, the voiceprint management device may include a processor 1210, and the processor 1210 executes the method performed by the voiceprint management device in the above method embodiments by calling a program in the memory. The processor here may be an integrated circuit with signal processing capabilities, such as a CPU. The voiceprint management device can be realized by one or more integrated circuits configured to implement the above method. For example: one or more ASICs, or one or more microprocessors DSP, or one or more FPGAs, etc., or a combination of at least two of these integrated circuit forms. Alternatively, the above implementation manners may be combined.

Specifically, the functions/implementation process of the processing module 1102 and the obtaining module 1101 in FIG. 11 can be realized by calling the computer-executed instructions stored in the memory 1220 by the processor 1210 in the voiceprint management device shown in FIG. 12 .

Based on the above content and the same idea, the present application provides a computer program product, the computer program product includes computer programs or instructions, and when the computer program or instructions are executed by a computing device, the methods in the above method embodiments are implemented.

Based on the above content and the same idea, the present application provides a computer-readable storage medium, in which computer programs or instructions are stored, and when the computer programs or instructions are executed by a computing device, the methods in the above-mentioned method embodiments are implemented .

Based on the above content and the same idea, the present application provides a computing device, including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the computing device performs the above method Methods in the Examples.

Based on the above content and the same idea, an embodiment of the present application provides a chip system, including: a processor and a memory, the processor is coupled to the memory, and the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the The chip system implements the methods in the foregoing method embodiments.

Optionally, there may be one or more processors in the chip system, and the processors may be implemented by hardware or by software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented by software, the processor may be a general-purpose processor implemented by reading software codes stored in a memory.

Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Apparently, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.

Claims

A voiceprint management method, characterized by comprising:

Acquiring the verification result of the first voiceprint in the first time window, wherein the verification result of the first voiceprint is determined according to the average similarity between the first voiceprint, the reference voiceprint, and the second time window, and the The first voiceprint is determined according to the first voice signal of the first user acquired in the first time window, and the average similarity of the second time window is used to indicate the The voiceprint of the first user is similar to the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window;

The reference voiceprint is updated according to the verification result of the first voiceprint in the first time window.
The method according to claim 1, wherein the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including:

When the similarity between the first voiceprint and the reference voiceprint is greater than the average similarity in the second time window, the verification result of the first voiceprint is verified.
The method according to claim 1 or 2, wherein updating the reference voiceprint according to the verification result of the first voiceprint in the first time window comprises:

The reference voiceprint is updated according to the ratio of the verified first voiceprint in the first time window.
The method according to claim 3, wherein updating the reference voiceprint according to the ratio of the verified first voiceprint in the first time window comprises:

When the ratio of the verified first voiceprint in the first time window is less than or equal to a ratio threshold, the reference voiceprint is updated.
The method according to any one of claims 1 to 4, wherein said updating said reference voiceprint comprises:

Acquiring the second voice signal of the first user, wherein the difference between the similarity between the second voice signal and the reference voiceprint and the average similarity in the first time window is less than a difference threshold, the The average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint;

The reference voiceprint is updated according to the second voice signal.
The method according to claim 5, wherein said updating said reference voiceprint according to said second voice signal comprises:

Obtain a de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, and the text corresponding to the de-emphasis voice signal is non-repetitive text;

The reference voiceprint is updated according to the de-emphasized voice signal.
The method according to claim 6, wherein the second speech signal is one, and obtaining the de-emphasis speech signal according to the second speech signal and the text corresponding to the second speech signal comprises:

performing a deduplication operation on the text corresponding to the second speech signal to obtain deduplication text;

The de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text in the second voice signal.
The method according to claim 6, wherein there are multiple second voice signals, and the de-emphasis voice signal is obtained according to the second voice signal and the text corresponding to the second voice signal, comprising :

For the i-th second speech signal among the plurality of second speech signals:

According to the historical deduplication text, perform a deduplication operation on the text corresponding to the ith second voice signal; after the deduplication operation, the text corresponding to the ith second voice signal and the historical deduplication text Splicing to obtain the deduplication text corresponding to the 1st voice signal to the i-th second voice signal; the historical deduplication text is based on the 1st second voice signal to the i-1th second voice signal The text corresponding to the voice signal is obtained, i is greater than 1;

The de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text in the multiple second voice signals corresponding to the multiple second voices.
The method according to any one of claims 6 to 8, wherein the duration of the de-emphasized speech signal is greater than a duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasized speech signal is greater than word count threshold.
The method according to any one of claims 1 to 9, further comprising:

sliding the first time window, wherein the end point of the first time window after sliding is later than the end point of the first time window before sliding.
The method according to any one of claims 1 to 10, characterized in that the length of the first time window is variable.
The method according to any one of claims 1 to 11, further comprising:

sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the starting point of the second time window after sliding is earlier than the end point of the second time window after sliding The starting point of the first time window of .
The method according to any one of claims 1 to 12, wherein before acquiring the verification result of the first voiceprint in the first time window, further comprising:

determining the similarity in the second time window according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window;

An average similarity in the second time window is determined according to the multiple similarities in the second time window.
A voiceprint management device, characterized by comprising:

Acquiring modules and processing modules;

The acquiring module is used to acquire the verification result of the first voiceprint in the first time window, wherein the verification result of the first voiceprint is based on the average of the first voiceprint, the reference voiceprint, and the second time window. The first voiceprint is determined according to the first voice signal of the first user acquired in the first time window, and the average similarity in the second time window is used to indicate the The voiceprint of the first user obtained in the second time window is similar to the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window;

The processing module is configured to update the reference voiceprint according to the verification result of the first voiceprint in the first time window.
The device according to claim 14, wherein the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, the reference voiceprint, and the second time window, including:

When the similarity between the first voiceprint and the reference voiceprint is greater than the average similarity in the second time window, the verification result of the first voiceprint is verified.
The device according to claim 14 or 15, wherein the processing module is specifically used for:

The reference voiceprint is updated according to the ratio of the verified first voiceprint in the first time window.
The device according to claim 16, wherein the processing module is specifically used for:

When the ratio of the verified first voiceprint in the first time window is less than or equal to a ratio threshold, the reference voiceprint is updated.
The device according to any one of claims 14 to 17, wherein the processing module is specifically used for:

controlling the acquiring module to acquire the second voice signal of the first user, wherein the difference between the similarity between the second voice signal and the reference voiceprint and the average similarity of the first time window is less than difference threshold, the average similarity of the first time window is used to indicate the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint;

The reference voiceprint is updated according to the second voice signal.
The device according to claim 18, wherein the processing module is specifically used for:

Obtain a de-emphasis voice signal according to the second voice signal and the text corresponding to the second voice signal, and the text corresponding to the de-emphasis voice signal is non-repetitive text;

The reference voiceprint is updated according to the de-emphasized voice signal.
The device according to claim 18, wherein there is one second voice signal, and the processing module is specifically used for:

performing a deduplication operation on the text corresponding to the second speech signal to obtain deduplication text;

The de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text in the second voice signal.
The device according to claim 18, wherein there are multiple second voice signals, and the processing module is specifically used for:

For the i-th second speech signal among the plurality of second speech signals:

According to the historical deduplication text, perform a deduplication operation on the text corresponding to the ith second voice signal; after the deduplication operation, the text corresponding to the ith second voice signal and the historical deduplication text Splicing to obtain the deduplication text corresponding to the 1st voice signal to the i-th second voice signal; the historical deduplication text is based on the 1st second voice signal to the i-1th second voice signal The text corresponding to the voice signal is obtained, i is greater than 1;

The de-emphasis voice signal is obtained according to the voice signal corresponding to the de-emphasis text corresponding to the multiple second voices in the multiple second voice signals.
The device according to any one of claims 18 to 21, wherein the duration of the de-emphasis speech signal is greater than a duration threshold, and/or, the number of characters in the non-repetitive text corresponding to the de-emphasis speech signal is greater than word count threshold.
The device according to any one of claims 14 to 22, wherein the processing module is further used for:

sliding the first time window, wherein the end point of the first time window after sliding is later than the end point of the first time window before sliding.
The device according to any one of claims 14 to 23, wherein the length of the first time window is variable.
The device according to any one of claims 14 to 24, wherein the processing module is further used for:

sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the starting point of the second time window after sliding is earlier than the end point of the second time window after sliding The starting point of the first time window of .
The device according to any one of claims 14 to 25, wherein before the acquiring module acquires the verification result of the first voiceprint in the first time window, the processing module is further configured to:

determining the similarity in the second time window according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window;

An average similarity in the second time window is determined according to the multiple similarities in the second time window.
A computer program product, characterized in that the computer program product includes a computer program or instruction, and when the computer program or instruction is executed by a computing device, the method according to any one of claims 1 to 13 is implemented.
A computer-readable storage medium, characterized in that computer programs or instructions are stored in the computer-readable storage medium, and when the computer programs or instructions are executed by a computing device, any one of claims 1 to 13 can be realized. method described in the item.
A computing device, characterized in that it includes a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the calculation The device performs the method as claimed in any one of claims 1 to 13.