CN115699168A

CN115699168A - Voiceprint management method and device

Info

Publication number: CN115699168A
Application number: CN202180041437.8A
Authority: CN
Inventors: 张嘉祺
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2023-02-03
Also published as: WO2022236827A1

Abstract

A voiceprint management method and a voiceprint management device are used for timely and accurately updating a reference voiceprint in voiceprint management equipment so as to improve the accuracy of identification in voiceprint identification. The method comprises the following steps: obtaining a verification result of a first voiceprint in a first time window, wherein the verification result of the first voiceprint is determined according to average similarities of the first voiceprint, a reference voiceprint and a second time window, the first voiceprint is determined according to a first voice signal of a first user obtained in the first time window, the average similarity of the second time window is used for indicating a similarity condition of the voiceprint of the first user obtained in the second time window and the reference voiceprint, and a starting point of the first time window is later than a starting point of the second time window; and updating the reference voiceprint according to the verification result of the first voiceprint in the first time window.

Description

Voiceprint management method and device

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a voiceprint management method and apparatus.

Background

With the iteration of the technology and the improvement of the legal standard, the voiceprint recognition technology is widely used in a plurality of scenes, such as a vehicle network scene, an intelligent home scene, a business handling scene and the like. The voiceprint recognition technology is to compare the voiceprint stored in the voiceprint management device with the collected voiceprint to judge the identity information of the user.

The voiceprint management apparatus stores in advance a reference voiceprint of the user (i.e., a voiceprint in which the user is registered in the voiceprint management apparatus). In voiceprint recognition, when the voiceprint management device acquires the collected voiceprint of the user, the collected voiceprint can be compared with the reference voiceprint, and it is further determined that the collected voiceprint and the reference voiceprint correspond to the same user.

However, the voiceprint of the user may change along with the change of time, and in order to ensure the accuracy of voiceprint recognition, the reference voiceprint in the voiceprint management device needs to be updated in time.

Disclosure of Invention

The application provides a voiceprint management method and a voiceprint management device, which are used for timely and accurately updating a reference voiceprint in voiceprint management equipment so as to improve the accuracy of identification in voiceprint identification.

In a first aspect, the present application provides a voiceprint management method, which may be executed by a terminal device, such as a vehicle-mounted device (e.g., a vehicle-mounted device, a vehicle-mounted computer, etc.), and a user terminal (e.g., a mobile phone, a computer, etc.). The voiceprint management method can also be implemented by a component of the terminal device, such as a processing device, a circuit, a chip, and the like in the terminal device, for example, a system chip or a processing chip in the terminal device. The SOC chip is also called a System On Chip (SOC) chip.

The method can also be executed by a server, and the server can comprise a physical device such as a host or a processor, a virtual device such as a virtual machine or a container, and a chip or an integrated circuit. The server is, for example, a vehicle networking server, also referred to as a cloud server, a cloud end server, or a cloud end controller, and the vehicle networking server may be a single server, or a server cluster formed by a plurality of servers, and is not limited specifically. Furthermore, the method may also be performed by components of the server, such as by processing devices, circuits, chips, etc., in the server.

In one possible implementation manner, the voiceprint management method provided by the present application includes: obtaining a verification result of a first voiceprint in a first time window, wherein the verification result of the first voiceprint is determined according to the average similarity of the first voiceprint, a reference voiceprint and a second time window, the first voiceprint is determined according to a first voice signal of a first user obtained in the first time window, the average similarity of the second time window is used for indicating the similarity of the voiceprint of the first user obtained in the second time window and the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window; and updating the reference voiceprint according to the verification result of the first voiceprint in the first time window.

It should be understood that, in the above technical solution, the average similarity of the second time window is determined in advance, and the average similarity of the second time window and the reference voiceprint are used together as a reference parameter for determining whether to update the reference voiceprint, so as to avoid a problem of inaccurate determination caused by inaccurate single registration voiceprint when the first user is registered. Furthermore, the similarity between each of the first voiceprints in the first time window and the reference voiceprint is determined, and whether the voiceprint of the first user has a permanent change or not is determined by combining the average similarity of the second time window. Updating the reference voiceprint of the first user under the condition that the voiceprint of the first user changes permanently; and in the case that the voiceprint of the first user has not changed for a long time (e.g., has changed for a short time), the reference voiceprint of the first user is not updated. The accuracy of the reference voiceprint of the first user is improved, and the recognition accuracy and robustness in voiceprint recognition are improved.

In one possible implementation, the determining the verification result of the first voiceprint according to the average similarity of the first voiceprint, the reference voiceprint and the second time window includes: and when the similarity between the first voiceprint and the reference voiceprint is greater than the average similarity of the second time window, the verification result of the first voiceprint is verified, and/or when the similarity between the first voiceprint and the reference voiceprint is less than or equal to the average similarity of the second time window, the verification result of the first voiceprint is verified.

It should be understood that, in the above technical solution, the average similarity between the reference voiceprint and the second time window is used as a reference parameter for determining whether the first voiceprint passes the verification, so as to avoid a problem that the voiceprint registration is inaccurate due to an accidental reason when the first user is registered, thereby being beneficial to improving the accuracy of the verification result of the first voiceprint. Furthermore, only the verification result (verified or not verified) of the first voiceprint in the first time window needs to be stored, and the plurality of first voiceprints in the first time window does not need to be stored, which is beneficial to reducing the storage amount.

In one possible implementation, updating the reference voiceprint based on the verification of the first voiceprint in the first time window includes: and updating the reference voiceprint according to the ratio of the verified first voiceprints in the first time window (namely the achievement rate in the first time window), wherein the ratio of the verified first voiceprints is the ratio of the number of the verified first voiceprints in the first time window to the number of the first voiceprints in the first time window. In one possible implementation, updating the reference voiceprint based on a ratio of validated first voiceprints in the first time window comprises: updating the reference voiceprint when a ratio of verified first voiceprints in the first time window is less than or equal to a ratio threshold.

It should be understood that, in the above technical solution, the verification results of the plurality of first voiceprints may be obtained in the first time window, and then, according to the verification results of the plurality of first voiceprints in the first time window, a ratio of verified first voiceprints is determined, where the ratio may be used to indicate whether the voiceprint of the first user has changed for a long time, and then the reference voiceprint of the first user is updated or not updated according to the ratio of the first voiceprint. Avoiding erroneous updating of the first user's reference voiceprint due to the contingency of the result of the verification of one first voiceprint. Whether the voiceprint of the first user changes for a long time or not is accurately identified, so that the reference voiceprint of the first user is updated under the condition that the voiceprint of the first user changes for a long time; and under the condition that the voiceprint of the first user is not changed for a long time, the original reference voiceprint of the first user is kept.

In one possible implementation, updating the reference voiceprint includes: acquiring a second voice signal of the first user, wherein the similarity between the second voice signal and a reference voiceprint is smaller than a difference threshold value with respect to the average similarity of a first time window, and the average similarity of the first time window is used for indicating the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; the reference voiceprint is updated based on the second speech signal.

It should be understood that, in the above technical solution, the similarity between the acquired second voice signal and the reference voiceprint meets the preset condition, and it can be understood that, in a period from the starting point of the first time window to the acquisition of the second voice signal, a change of the voiceprint of the first user is smaller than a change threshold (or it is understood that the voiceprint of the first user is in a steady state or has not changed for a long time), the reference voiceprint can be updated according to the second voice signal.

In one possible implementation, updating the reference voiceprint based on the second speech signal includes: obtaining a duplication-removing voice signal according to the second voice signal and a text corresponding to the second voice signal, wherein the text corresponding to the duplication-removing voice signal is a non-repeated text; and updating the reference voiceprint according to the duplicate removal voice signal.

In a possible implementation manner, the obtaining the duplicate removal speech signal according to the second speech signal and a text corresponding to the second speech signal includes: executing duplication removing operation on the text corresponding to the second voice signal to obtain duplication removing text; and obtaining the de-duplicated speech signal according to the speech signal corresponding to the de-duplicated text in the second speech signal.

In a possible implementation manner, the obtaining the duplicate removal speech signal according to the second speech signal and the text corresponding to the second speech signal includes: for an ith second speech signal of the plurality of second speech signals: according to the historical duplication removing text, executing duplication removing operation on the text corresponding to the ith second voice signal; splicing the text corresponding to the ith second voice signal after the duplication elimination operation with the historical duplication elimination text to obtain a duplication elimination text corresponding to the 1 st voice signal to the ith second voice signal; the historical duplication-removing text is obtained according to texts corresponding to the 1 st second voice signal to the (i-1) th second voice signal, wherein i is larger than 1; and obtaining the de-duplicated speech signal according to the speech signal corresponding to the de-duplicated text commonly corresponding to the second speech signals in the second speech signals.

It should be understood that, in the above technical solution, the duplication removing operation is performed on the second voice signal according to the second voice signal and the text corresponding to the second voice signal to obtain the duplication removed voice signal, so that it is avoided that the accuracy of the extracted reference voiceprint is affected due to the presence of multiple high-frequency words or high-frequency words in the second voice signal of the user.

In one possible implementation, the duration of the deduplicated speech signal is greater than a duration threshold, and/or the number of words in the non-repeated text corresponding to the deduplicated speech signal is greater than a word count threshold.

It should be understood that, in the above technical solution, when the duration of the deduplicated speech signal is greater than the duration threshold, and/or the number of characters in the non-repetitive text corresponding to the deduplicated speech signal is greater than the word number threshold, the reference voiceprint is updated according to the deduplicated speech signal, so that the deduplicated speech signal with sufficient duration and/or corresponding to sufficient word number can be obtained, and the accuracy of the extracted reference voiceprint is further improved.

In one possible implementation manner, the method further includes: and sliding the first time window, wherein the end point of the first time window after sliding is later than the end point of the first time window before sliding.

It should be understood that, in the above technical solution, the first time window may be slid, then the verification result of the first voiceprint in the first time window after sliding is obtained, and the rate of the first voiceprint passing the verification in the first time window after sliding (i.e. the achievement rate in the first time window after sliding) is determined according to the verification result of the first voiceprint in the first time window after sliding. The end point of the first time window after sliding is later than the end point of the first time window before sliding, and the interval duration between the end point and the first time window is less than the duration of the first time window. Equivalently, the calculation of the achievement rate in the first time window is dynamic, and the achievement rate of the time span (namely the first time window) is determined at regular time intervals, so that the permanent change of the voiceprint of the first user can be found in time, and the registered voiceprint of the first user is updated.

In one possible implementation, the length of the first time window is variable.

It should be understood that, in the above technical solution, a first time threshold may be set, and when it is determined that the number of the verification results of the first voiceprint in the first time window is less than the first time threshold, the duration of the first time window may be automatically extended until the determined number of the verification results of the first voiceprint reaches the first time threshold. According to the verification result of the first voiceprint reaching the first time threshold, the accuracy of identifying the voiceprint of the first user, which changes for a long time, is improved.

In one possible implementation manner, the method further includes: and sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding.

In one possible implementation, the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding, including: and sliding the second time window when the reference voiceprint is updated, wherein the starting point of the second time window after sliding is not earlier than the time point of updating the reference voiceprint.

In one possible implementation, the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding, including: and sliding the second time window when the reference voiceprint is updated, wherein the starting point of the second time window after sliding is not earlier than the time point of the second voice signal acquisition.

It should be understood that, in the above technical solution, when the reference voiceprint is updated, the second time window may be slid, then the second voiceprint in the second time window after sliding is obtained, the similarity between the second voiceprint in the second time window after sliding and the reference voiceprint is determined, and the average similarity of the second time window after sliding is obtained. After the reference voiceprint is updated, the average similarity of the second time window is updated in time, so that whether the voiceprint of the first user changes again for a long time or not is judged subsequently.

In one possible implementation, the length of the second time window is variable.

It should be understood that, in the foregoing technical solution, a second time threshold may be set, and when it is determined that the number of similarities between the second voiceprint and the reference voiceprint in the second time window is smaller than the second time threshold, the duration of the second time window may be automatically extended until the determined number of similarities between the second voiceprint and the reference voiceprint reaches the second time threshold. According to the similarity corresponding to the second voiceprint reaching the second quadratic threshold, the accuracy of the average similarity of the second time window is improved.

In a possible implementation manner, before obtaining the verification result of the first voiceprint in the first time window, the method further includes: determining the similarity in the second time window according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window; and determining the average similarity of the second time window according to the plurality of similarities in the second time window.

It should be understood that, in the above technical solution, only the average similarity of the second time window needs to be stored, and the plurality of second sound fingerprints in the second time window does not need to be stored, which helps to reduce the storage amount. The average similarity of the second time window is used for determining the verification result of the first voiceprint in the first time window, and the implementation is easy.

In a second aspect, the present application provides a voiceprint management apparatus, which may be a terminal device or a component (such as a processing apparatus, a circuit, a chip, etc.) in the terminal device. The terminal device includes, for example, a vehicle-mounted device (e.g., a vehicle-mounted device, a vehicle-mounted computer, etc.), and a user terminal (e.g., a mobile phone, a computer, etc.). The apparatus may also be a server, or a component (e.g., processing device, circuit, chip, etc.) in a server. The server may include a physical device such as a host or a processor, a virtual device such as a virtual machine or a container, and a chip or an integrated circuit. The server is, for example, a vehicle networking server, also referred to as a cloud server, a cloud end server, or a cloud end controller, and the vehicle networking server may be a single server, or a server cluster formed by a plurality of servers, and is not limited specifically.

In one possible implementation manner, the voiceprint management apparatus provided by the present application includes: the device comprises an acquisition module and a processing module; the acquisition module is used for acquiring a verification result of a first voiceprint in a first time window, wherein the verification result of the first voiceprint is determined according to average similarity of the first voiceprint, a reference voiceprint and a second time window, the first voiceprint is determined according to a first voice signal of a first user acquired in the first time window, the average similarity of the second time window is used for indicating the similarity between the voiceprint of the first user acquired in the second time window and the reference voiceprint, and the starting point of the first time window is later than the starting point of the second time window; the processing module is used for updating the reference voiceprint according to the verification result of the first voiceprint in the first time window.

In one possible implementation, the determining the verification result of the first voiceprint according to the average similarity of the first voiceprint, the reference voiceprint and the second time window includes: and when the similarity of the first voiceprint and the reference voiceprint is greater than the average similarity of the second time window, the verification result of the first voiceprint is verified.

In a possible implementation manner, the processing module is specifically configured to: and updating the reference voiceprint according to the ratio of the verified first voiceprints in the first time window.

In a possible implementation manner, the processing module is specifically configured to: updating the reference voiceprint when a ratio of verified first voiceprints in the first time window is less than or equal to a ratio threshold.

In a possible implementation manner, the processing module is specifically configured to: the control acquisition module acquires a second voice signal of the first user, wherein the difference between the similarity of the second voice signal and the reference voiceprint and the average similarity of the first time window is smaller than a difference threshold value, and the average similarity of the first time window is used for indicating the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint; the reference voiceprint is updated based on the second speech signal.

In a possible implementation manner, the processing module is specifically configured to: obtaining a duplication-removing voice signal according to the second voice signal and a text corresponding to the second voice signal, wherein the text corresponding to the duplication-removing voice signal is a non-repetitive text; and updating the reference voiceprint according to the duplicate removal voice signal.

In a possible implementation manner, the second speech signal is one, and the processing module is specifically configured to: executing duplication removing operation on the text corresponding to the second voice signal to obtain duplication removing text; and obtaining the de-duplicated speech signal according to the speech signal corresponding to the de-duplicated text in the second speech signal.

In a possible implementation manner, the number of the second voice signals is multiple, and the processing module is specifically configured to: for an ith second speech signal of the plurality of second speech signals: according to the historical duplication removing text, carrying out duplication removing operation on the text corresponding to the ith second voice signal; splicing the text corresponding to the ith second voice signal after the duplication removing operation with the historical duplication removing text to obtain duplication removing texts jointly corresponding to the 1 st voice signal and the ith second voice signal; the historical de-duplication text is obtained according to texts corresponding to the 1 st second voice signal to the (i-1) th second voice signal, wherein i is larger than 1; and obtaining the de-duplicated speech signal according to the speech signal corresponding to the de-duplicated text commonly corresponding to the second speeches in the second speech signals.

In one possible implementation, the processing module is further configured to: and sliding the first time window, wherein the end point of the first time window after sliding is later than the end point of the first time window before sliding.

In one possible implementation, the processing module is further configured to: and sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding.

In a possible implementation manner, before the obtaining module obtains the verification result of the first voiceprint in the first time window, the processing module is further configured to: determining the similarity in the second time window according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window; and determining the average similarity of the second time window according to the plurality of similarities in the second time window.

In a third aspect, the present application provides a computer program product comprising a computer program or instructions for implementing the method of the first aspect or any one of the possible implementations of the first aspect when the computer program or instructions are executed by a computing device.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program or instructions for implementing the method of the first aspect or any of its possible implementations when executed by a computing device.

In a fifth aspect, the present application provides a computing device comprising a processor connected to a memory, the memory being configured to store a computer program, and the processor being configured to execute the computer program stored in the memory, so as to enable the computing device to perform the method of the first aspect or any of the possible implementation manners of the first aspect.

In a sixth aspect, an embodiment of the present application provides a chip system, including: a processor coupled to the memory, and a memory for storing a program or instructions which, when executed by the processor, cause the system-on-chip to implement the method of the first aspect or any of the possible implementations of the first aspect.

Optionally, the system-on-chip further comprises an interface circuit for interfacing the code instructions to the processor.

Optionally, the number of processors in the chip system may be one or more, and the processors may be implemented by hardware or software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented in software, the processor may be a general-purpose processor implemented by reading software code stored in a memory.

Optionally, the memory in the system-on-chip may also be one or more. The memory may be integral to the processor or may be separate from the processor. Illustratively, the memory may be a non-transitory processor, such as a read only memory ROM, which may be integrated on the same chip as the processor or may be separately provided on different chips.

It should be understood that, in the technical solutions of the first aspect to the sixth aspect, the reference voiceprint of the first user is predetermined, then the second voiceprint of the first user is obtained in the second time window, the similarity between the second voiceprint and the reference voiceprint is determined, and the average similarity of the second time window is determined according to multiple similarities in the second time window. Since the second time window is after the time point of storing the reference voiceprint, i.e. the user's voiceprint is in a steady change during the period from the time point of registering the reference voiceprint to the end point of the second time window, the reference parameter (i.e. the average similarity of the reference voiceprint and the second time window) can be obtained based on the voiceprint of the first user in a steady change. In the subsequent process of determining whether the voiceprint of the user changes permanently, the similarity between the first voiceprint and the reference voiceprint is determined based on the first voiceprint of the user in the first time window, the similarity is compared with the average similarity of the second time window, the verification result of the first voiceprint is determined, that is, whether the first voiceprint in the first time window is similar to the second voiceprint in the second time window is determined, and further, whether the voiceprint of the user changes permanently in the time period from the starting point of the second time window to the ending point of the first time window is determined.

Moreover, the verification results of the plurality of first voiceprints in the first time window are determined, the rate of the first voiceprints passing verification in the first time window (namely the standard reaching rate in the first time window) is determined, whether the voiceprints of the user are changed for a long time or not is determined according to the standard reaching rate in the first time window, misjudgment caused by accidental reasons of the user in the first time window is avoided, and the judgment accuracy is improved.

Furthermore, when the reference voiceprint is not updated, the first time window can be slid, whether the voiceprint of the user is changed for a long time or not is determined according to the standard reaching rate in the first time window after the sliding through the verified rate of the first voiceprint (namely the standard reaching rate in the first time window after the sliding), and therefore the voiceprint of the first user can be found to be changed for a long time in time, and the registered voiceprint of the first user can be updated.

Furthermore, when the reference voiceprint is updated, the second voice signal is obtained, the voiceprint of the first user is determined according to the obtained second voice signal, interaction with the user is not needed, no perception of the user is achieved, and user experience is improved. And a plurality of first voiceprints in the first time window do not need to be stored in advance, so that the storage capacity is reduced. In the application, a plurality of second fingerprints in the second time window do not need to be stored, and only the average similarity of the second time window needs to be stored, so that the storage capacity is further reduced.

Drawings

FIG. 1a is a schematic diagram of a speech signal processing system according to an exemplary embodiment of the present disclosure;

FIG. 1b is a schematic flow chart of a semantic understanding process provided by way of example in the present application;

FIG. 1c is a schematic flow chart of a voiceprint extraction process provided by way of example in the present application;

FIG. 2 is a schematic view of an exemplary in-vehicle scenario provided herein;

FIG. 3 is a schematic view of another exemplary onboard scenario provided herein;

fig. 4 is a schematic display diagram of a mobile phone interface exemplarily provided in the present application;

fig. 5 is a schematic diagram of a corresponding relationship between a voiceprint management flow and time according to an example provided in the present application;

FIG. 6 is a schematic flow chart of voiceprint authentication provided by example in the present application;

fig. 7 is a schematic diagram of a corresponding relationship between another voiceprint management process and time provided by the present application;

FIG. 8 is a schematic flow chart of updating a reference voiceprint according to an example provided herein;

FIG. 9 is a schematic view of another exemplary onboard scenario provided herein;

FIG. 10 is a schematic flow chart of another voiceprint management method provided by an example of the present application;

fig. 11 is a schematic structural diagram of a voiceprint management apparatus according to an example provided in the present application;

fig. 12 is a schematic structural diagram of another voiceprint management apparatus exemplarily provided in the present application.

Detailed Description

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1a is a schematic structural diagram of a speech signal processing system exemplarily provided by the present application, where the speech signal processing system may include at least a speech acquisition device, a voiceprint management device, and a semantic understanding device.

For example, the voice signal processing system may collect a voice signal of a user through the voice collecting device, and input the voice signal into the semantic understanding device and the voiceprint management device, respectively, where the semantic understanding device may be configured to perform a semantic extraction process on the voice signal to obtain a machine-recognizable instruction corresponding to the voice signal of the user; the voiceprint management device can determine the voiceprint feature vector of the user according to the voice signal and execute a corresponding recognition process based on the voiceprint feature vector.

The speech acquisition device may be a microphone array, which may be composed of a number of acoustic sensors, typically microphones. The microphone array may have one or more of the following functions: speech enhancement, a process of extracting pure speech from a speech signal containing noise; sound source positioning, namely calculating the angle and the distance of a target speaker by using a microphone array so as to realize the tracking of the target speaker and the subsequent directional voice pickup; dereverberation is carried out, and the influence of some reflected sound is reduced; sound source signal extraction/separation, a plurality of mixed sounds are all extracted. The microphone array can be suitable for complex environments with multiple noises, noises and echoes, such as vehicles, outdoors, supermarkets and the like.

Referring to a flow chart of a semantic understanding process exemplarily shown in fig. 1b, the semantic understanding device may sequentially perform the following processes on the speech signal:

(1) Automatic Speech Recognition (ASR): a speech signal input by a user is converted into a natural language text. In a possible mode, the voice signal may be processed as a sound wave, specifically, the voice signal is subjected to framing processing to obtain a small segment of waveform corresponding to each frame. And for a small section of waveform corresponding to each frame, converting the small section of waveform into multi-dimensional vector information according to the characteristics of human ears, wherein the time length of each frame can be about 20ms to 30ms. According to the multi-dimensional vector information, a plurality of phonemes (phones) corresponding to the multi-dimensional vector information are obtained through decoding, and the phonemes are combined into words and connected in series to form sentences (namely natural language texts) to be output.

(2) Natural Language Processing (NLP): meaningful parts in natural language text are converted into structured information that can be understood by a machine.

(3) Filling semantic slot positions: and filling the structured information obtained by the natural language processing into the corresponding slot position, so that the user intention is converted into a user instruction which can be recognized by a machine.

Referring to fig. 1c, which is a schematic flow diagram of a voiceprint extraction process, the voiceprint management device may include a voiceprint extraction module, and the voiceprint extraction module may be configured to perform voiceprint extraction on a voice signal to obtain a corresponding voiceprint feature vector. For example, the voiceprint extraction module may sequentially perform preprocessing (also referred to as pre-processing), voiceprint feature extraction, and post-processing on the voice signal.

(1) Pretreatment: and extracting audio characteristic information in the voice signal. For example, in the extracting process of the audio feature information, at least one or more of the following operations may be performed on the speech signal: denoising, voice Activity Detection (VAD), perceptual Linear Prediction (PLP), mel-frequency cepstral coefficients (MFCC) calculation.

(2) Voiceprint feature extraction: and transmitting the audio characteristic information to a voiceprint characteristic extraction model, and correspondingly, outputting a voiceprint characteristic vector by the voiceprint characteristic extraction model. Voiceprint feature extraction models include, but are not limited to: one or more of a Gaussian Mixture Model (GMM), joint Factor Analysis (JFA) model, i-vector model, d-vector model, and x-vector model. In this application, the voiceprint feature vector may be referred to as voiceprint for short.

(3) And (3) post-treatment: and performing post-processing on the voiceprint output by the voiceprint feature extraction model to obtain the final voiceprint. Illustratively, post-processing may include one or more of: linear Discriminant Analysis (LDA), probabilistic Linear Discriminant Analysis (PLDA), and perturbation attribute projection (NAP).

As described above, the voiceprint extraction module can extract a corresponding voiceprint from the voice signal, the voiceprint is the same as a human face, a fingerprint, an iris, and the like, belongs to one of biological information (biometric information), and can determine the identity information of the user who originates the voice signal according to the voiceprint. According to the identity information of the voiceprint recognition user, compared with face recognition, the method can be free of limitation of face shielding of the user, compared with fingerprint recognition, the method can be free of limitation of physical contact, and the implementability is high.

The voiceprint management device can further comprise a voiceprint recognition module, and the voiceprint recognition module can be used for recognizing the identity information of the user according to the voiceprint. In a possible implementation manner, the voiceprint recognition module stores a reference voiceprint of the user in advance, and the voiceprint recognition module can compare the voiceprint (which may be called as a collected voiceprint) from the voiceprint extraction module with the reference voiceprint to determine whether the collected voiceprint and the reference voiceprint correspond to the same user.

In the application, the voiceprint recognition module can determine whether the collected voiceprint and the reference voiceprint correspond to the same user according to the similarity threshold. In the application, the voiceprint recognition module can determine the similarity threshold value according to the similarity of the registered voiceprints and the collected voiceprints of the N users as a sample, wherein N is a positive integer. In an optional manner, the voiceprint recognition module may obtain N similarity degrees for the similarity degree between the registered voiceprint and the collected voiceprint of each of the N users. And then determining a similarity threshold according to the N similarities. For example, the voiceprint recognition module can use the average value, or median value, of the N similarities as the similarity threshold. For example, the similarity threshold may be 0.75.

In a first example, if the voiceprint recognition module determines that the similarity between the collected voiceprint and the reference voiceprint is greater than the similarity threshold, it may be determined that the collected voiceprint and the reference voiceprint correspond to the same user. If the voiceprint recognition module determines that the similarity between the collected voiceprint and the reference voiceprint is smaller than or equal to the similarity threshold, it can be determined that the collected voiceprint and the reference voiceprint correspond to different users. In the application, the similarity between the collected voiceprint and the reference voiceprint is greater than the similarity threshold, which is to be understood as that the collected voiceprint is matched with the reference voiceprint; the similarity between the collected voiceprint and the reference voiceprint can be less than or equal to a similarity threshold, which is understood to mean that the collected voiceprint and the reference voiceprint do not match.

In a second example, the voiceprint recognition module can also determine that the collected voiceprint and the reference voiceprint correspond to the same user if it is determined that the similarity between the collected voiceprint and the reference voiceprint is greater than or equal to a similarity threshold. And the voiceprint recognition module determines that the collected voiceprint and the reference voiceprint correspond to different users under the condition that the similarity between the collected voiceprint and the reference voiceprint is determined to be smaller than a similarity threshold. Correspondingly, in the present application, the similarity between the collected voiceprint and the reference voiceprint is greater than or equal to the similarity threshold, which is to be understood as that the collected voiceprint and the reference voiceprint are matched; the similarity between the collected voiceprint and the reference voiceprint can be less than a similarity threshold, which is understood to mean that the collected voiceprint and the reference voiceprint do not match.

For convenience of description, the first example is taken as an example.

It is to be added that, when the similarity between the collected voiceprint and the pre-registered multiple reference voiceprints is greater than the similarity threshold, the reference voiceprint with the maximum similarity to the collected voiceprint in the multiple reference voiceprints can be used as the matching voiceprint of the collected voiceprint, that is, it is determined that the reference voiceprint with the maximum similarity to the collected voiceprint in the multiple reference voiceprints is matched with the collected voiceprint, and the other reference voiceprints are not matched with the collected voiceprint.

Voiceprint recognition can be used in an identity recognition process, for example, reference voiceprints of one or more users are stored in a voiceprint recognition module, the voiceprint recognition module can compare a collected voiceprint with one or more stored reference voiceprints, determine a reference voiceprint matched with the currently collected voiceprint from the one or more reference voiceprints, and then determine identity information of the user according to the determined reference voiceprint. Furthermore, different authorities corresponding to different users can be stored in the voiceprint recognition module, and after the voiceprint recognition module determines the identity information of the user according to the collected voiceprint, the authority corresponding to the user can be further determined according to the identity information of the user.

The voiceprint recognition can also be used in an identity verification process, for example, reference voiceprints of one or more users who pass identity verification are stored in the voiceprint recognition module, the voiceprint recognition module can compare the collected voiceprint with the stored one or more reference voiceprints, and determine whether a reference voiceprint matched with the collected voiceprint exists, if so, the user corresponding to the currently collected voiceprint can pass identity verification, otherwise, the user corresponding to the currently collected voiceprint can be determined not to pass identity verification.

To aid in understanding the present solution, the flow of voiceprint recognition in the present application is explained below in conjunction with the use of a scene explanation. For example, the voiceprint management device in the usage scenario may be an in-vehicle device (e.g., a car machine, an in-vehicle computer, etc.), and the in-vehicle device may determine, based on the reference voiceprint, identity information of a user corresponding to the currently-collected voiceprint.

In one example, the vehicle device may determine whether the user is authenticated based on a reference voice print. Specifically, the vehicle-mounted equipment only provides the inquiry function of 'vehicle violation information' for the vehicle owner. The vehicle-mounted device can store the reference voiceprint a of the user A (the owner of the vehicle), and the reference voiceprint a is marked to correspond to the owner of the vehicle. Referring to the scene shown in fig. 2 (a), when the user a queries the vehicle violation information, it may say "query the vehicle violation information", at this time, the vehicle-mounted device may extract the voiceprint in the voice signal "query the vehicle violation information", determine that the extracted voiceprint matches the reference voiceprint a, and in response to the voice signal, display the current violation information, for example, "2021, 4, 20, make a red light, and deduct 6 minutes" on the display screen. Referring to the scenario shown in fig. 2 (B), when a user B (non-owner) queries vehicle violation information, the vehicle-mounted device may extract a voiceprint in a voice signal "query vehicle violation information" of the user B, determine that the extracted voiceprint is not matched with the reference voiceprint a, and prompt that the user B fails to query, for example, "query limited to owner" is displayed in the display interface.

The above example can also understand that, after determining the user identity information, the vehicle-mounted device may determine different permissions corresponding to different users based on different user identity information. Specifically, the owner has the right to query the violation information of the vehicle, while the non-owner does not have the right to query the violation information of the vehicle. When the vehicle-mounted equipment determines that the user is the owner, the function of inquiring the vehicle violation information is provided for the user, and when the vehicle-mounted equipment determines that the user is not the owner, the function of inquiring the vehicle violation information is refused to be provided for the user. In the example that the vehicle-mounted device determines different permissions corresponding to different users according to different user identity information, a reference voiceprint corresponding to a driver can be set in the vehicle-mounted device, and when the user voiceprint acquired by the vehicle-mounted device is matched with the reference voiceprint corresponding to the driver, the vehicle-mounted device can determine that the current user is the driver, and correspondingly, the permission corresponding to the driver can be provided for the current user, for example, the vehicle can be controlled to run through a voice signal.

In yet another example, the in-vehicle device provides different recommended content for different users. Specifically, the vehicle-mounted equipment can store the reference voiceprint a of the user A and record that the user A likes rock and roll music; and storing the reference voiceprint B of the user B and recording that the user B likes the light music. Referring to the scenario shown in fig. 3 (a), when the user a says "open music", the in-vehicle device may extract the voiceprint in the speech signal "open music", compare the extracted voiceprint with the stored reference voiceprint a and reference voiceprint b, respectively, determine that the extracted voiceprint matches with the reference voiceprint a, and recommend a rock-and-roll music list in the display interface. Referring to the scenario shown in fig. 3 (B), when the user B says "open music", the vehicle-mounted device may extract a voiceprint in the speech signal "open music", compare the extracted voiceprint with the stored reference voiceprint a and reference voiceprint B, respectively, determine that the extracted voiceprint matches with the reference voiceprint B, and recommend a list of light music in the display interface.

Of course, in this example, the in-vehicle device may also directly play rock music when it is determined that the extracted voiceprint matches the reference voiceprint a; or, the vehicle-mounted device can directly play the light music when the extracted voiceprint is determined to be matched with the reference voiceprint b. Or other implementations are possible, and the present application is not limited.

In combination with the above voiceprint management device determining that the user corresponds to different permissions according to different user identity information, another usage scenario is exemplarily provided, where the voiceprint management device in the usage scenario may be a user terminal, and the user terminal is, for example, a mobile phone. The mobile phone stores the reference voiceprint of the organic master user in advance, and can determine whether the acquired voiceprint is matched with the reference voiceprint, and further determine whether the user corresponding to the currently acquired voiceprint is the master user. When the mobile phone is in a screen locking state, the owner user can instruct the mobile phone to unlock through the voice signal and execute corresponding actions. For example, in the scenario shown in fig. 4, when the host user wants to use the album application in the mobile phone, the host user may say "open the album application" to the mobile phone in the screen-locked state, where the display interface of the mobile phone in the screen-locked state may be as shown in (a) in fig. 4. The mobile phone obtains a voice signal "open the album application" and extracts the voiceprint, and when the mobile phone determines that the extracted voiceprint matches the reference voiceprint, the corresponding album application can be opened in response to the voice signal, and the display interface of the mobile phone can be shown in fig. 4 (b).

In the above scenario, the voiceprint management device needs to compare the collected voiceprints based on the more accurate reference voiceprint, and then execute corresponding actions based on the comparison result (matching or mismatching). That is, the accuracy of the reference voiceprint stored in the voiceprint management apparatus affects the accuracy of the voiceprint recognition.

The user's voiceprint may change for a short time or change for a long time, wherein the short time change refers to a user's reversible voiceprint change due to external temporary stimulus, such as a user's cold causing the reversible voiceprint change. A permanent change is that a physiological change of the user results in an irreversible change of the user's voice print. The voiceprint management device needs to update the reference voiceprint of the user based on the user voiceprint which changes for a long time, and does not need to update the reference voiceprint of the user based on the user voiceprint which changes for a short time, so that the accuracy of the reference voiceprint is improved, and the accuracy of voiceprint recognition is improved.

Therefore, whether the voiceprint of the user changes or not needs to be distinguished accurately, and the change belongs to short-time change or long-time change. And then based on the user voiceprint which changes for a long time, updating the reference voiceprint of the user.

The present application illustratively provides a voiceprint management method, which can be performed by a voiceprint management apparatus. Illustratively, the voiceprint management apparatus may be the voiceprint management device exemplarily shown in fig. 1 a.

Illustratively, the voiceprint management apparatus may be a terminal device, or a component (such as a processing apparatus, a circuit, a chip, etc.) in the terminal device. The terminal device includes, for example, a vehicle-mounted device (e.g., a vehicle-mounted device, a vehicle-mounted computer, etc.), and a user terminal (e.g., a mobile phone, a computer, etc.).

As another example, the voiceprint management apparatus can be a server or a component (e.g., a processing apparatus, a circuit, a chip, etc.) in a server. The server may include a physical device such as a host or a processor, a virtual device such as a virtual machine or a container, and a chip or an integrated circuit. The server is, for example, a car networking server, also called a cloud server, a cloud end server, or a cloud end controller, and the car networking server may be a single server, or may be a server cluster formed by a plurality of servers, and is not particularly limited.

According to the voiceprint management method, whether the voiceprint collected in the current time window passes verification or not can be determined based on the reference voiceprint and the reference similarity, and then whether the reference voiceprint of the user needs to be updated or not is determined.

In order to better explain the embodiment of the present application, three processes of acquiring a reference voiceprint, acquiring a reference similarity and performing voiceprint verification in the voiceprint management method are sequentially explained in combination with the correspondence between the voiceprint management process and time exemplarily shown in fig. 5 as follows.

It is stated in advance that the following three processes are all directed to the same user (or called registrant, speaker, etc.), the reference voiceprint and the reference similarity of the user are obtained, the voiceprint of the user is verified, and whether the reference voiceprint of the user needs to be updated is determined. In a specific implementation, the users may be determined to be the same user based on technologies such as voiceprint comparison and face recognition.

In the voiceprint comparison, the similarity threshold value can be used for indicating whether two voiceprints correspond to the same user, and it is understood that the similarity between the collected voiceprint of the user and the reference voiceprint is greater than the similarity threshold value no matter whether the voiceprint of the user is changed; and the similarity between the voiceprints of different users is less than the similarity threshold. For example, the reference voiceprint registered by the user a is the reference voiceprint a, the reference voiceprint registered by the user B is the reference voiceprint B, and the similarity between the reference voiceprint a and the reference voiceprint B is smaller than the similarity threshold. When the user A speaks, according to the collected voice signal of the user A, the similarity between the determined voiceprint and the reference voiceprint a is larger than the similarity threshold, and the similarity between the determined voiceprint and the reference voiceprint B is smaller than the similarity threshold, and the user corresponding to the collected voiceprint is determined to be the user A but not the user B.

In the face recognition, when the user speaks, the collected voice signal is the voice signal of the user, and the voiceprint extracted according to the voice signal of the user is the voiceprint of the user. The method may include determining that the user is speaking based on a user mouth shape, opening and/or closing the user mouth shape according to a preset rule, determining that the current user is speaking if a voice signal corresponding to the preset rule is acquired, and acquiring the voice signal issued by the user.

In the present application, the user may also be identified in other ways. For example, a user account may be set, and when a user logs in through the account, and issues a voice signal, it may be determined that the obtained voice signal and the user corresponding to the account login are the same user. The account can also be replaced by a user fingerprint, and when the user logs in through the fingerprint and issues a voice signal, it can be determined that the acquired voice signal and the user corresponding to the fingerprint login are the same user.

In the application, one or more of voiceprint comparison, face recognition and account verification can be combined to determine the same user, so that the accuracy of determining the same user is improved.

For convenience of description, in the following embodiments, the same user targeted is referred to as the first user.

1. Acquiring a reference voiceprint flow:

the first user may register a voiceprint at a registration time point (t 0 as shown in fig. 5 (a)), wherein the voiceprint registered by the first user may be referred to as a registration voiceprint or a reference voiceprint. In a possible implementation manner, a section of preset characters may be displayed in the display interface, the first user reads the section of preset characters, obtains a current voice signal of the first user, extracts a voiceprint based on the voice signal, obtains a reference voiceprint of the first user, and stores the reference voiceprint of the first user.

2. A reference similarity obtaining process:

because the reference voiceprint is extracted based on a section of voice signal of the first user, in order to avoid inaccuracy of the reference voiceprint due to accidental reasons, the reference similarity of the second preset duration after the registration time point can be further obtained, and the reference similarity and the reference voiceprint can be jointly used as a reference parameter for determining whether the reference voiceprint needs to be updated, which can be specifically described in the following voiceprint verification process.

The time period after the registration time point and corresponding to the second preset time period may also be referred to as a reference time period, a reference time window, a second time window, and the like. Wherein the second time window can execute the sliding operation under a specific condition, and the actual duration of the second time window is variable.

In a specific acquisition process, a voice signal of the first user in a second time window (t 0-t1 shown in (a) in fig. 5) may be acquired, and a voiceprint, that is, a voiceprint (which may be called a second voiceprint) of the first user acquired in the second time window, is extracted from the voice signal. The second voiceprint can be compared with the reference voiceprint, and specifically, the similarity between the second voiceprint and the reference voiceprint can be determined according to a similarity algorithm (such as a cosine similarity algorithm, an euclidean distance algorithm, or the like).

Further, a plurality of second fingerprints may be collected in the second time window, that is, the similarity between each of the plurality of second fingerprints and the reference fingerprint may be determined. In one example, an average value between the plurality of similarities may be used as a reference similarity (which may be referred to as an average similarity) corresponding to the second time window. Of course, in the present application, a median of the plurality of similarities, or an average between a maximum value and a minimum value, may also be used as the reference similarity corresponding to the second time window, or in other manners. As follows, this may be exemplified by the average similarity corresponding to the second time window. Of course, in the present application, the average similarity may be replaced with the reference similarity to indicate the same meaning.

Here, in order to improve the accuracy of the average similarity of the second time window, a second time threshold may be preset, and when it is determined that the number of the second sound patterns in the second time window of the second preset time duration is smaller than the second time threshold, the time duration of the second time window may be automatically extended until it is determined that the number of the second sound patterns in the second time window reaches the second time threshold.

Illustratively, the second-order threshold is 10, and only 8 second fingerprints are collected, as compared to t0-t1 shown in fig. 5 (a). In order to improve the accuracy of the average similarity of the second time window, the duration of the second time window may be automatically extended, for example, the 10 th second sound pattern is not acquired until t1' is reached, and it may be determined that the termination point of the second time window is t1', that is, the second time window is t0-t1'. Further, an average similarity of the second time window may be determined based on the similarities of the 10 second sound prints to the reference sound print, respectively.

3. Voiceprint verification process:

in the application, a plurality of voiceprints of the first user in the verification time period (which may be referred to as a first time window) may be obtained, and whether the voiceprint of the first user has changed permanently or not is determined according to the obtained plurality of voiceprints, that is, whether the reference voiceprint of the first user needs to be updated according to the voiceprint of the first user after the voiceprint of the first user has changed permanently or not is determined.

In the present application, the starting point of the first time window is later than the starting point of the second time window, and the first time window may partially overlap with the second time window, or there is no overlap. The first time window may be set to a first preset time duration, and the first preset time duration may be equal to or different from the second preset time duration. Further, the first time window may perform a sliding operation in a specific situation, and an actual duration of the first time window may be variable.

Fig. 6 is a schematic flow chart of voiceprint verification, in which:

step 601, a first voice signal in a first time window is obtained.

A first speech signal of a first user in a first time window (t 1-t2 as shown in (a) of fig. 5) may be acquired and a first voiceprint of the first user is extracted from the first speech signal according to the voiceprint extraction procedure exemplarily shown in fig. 1 c.

Step 602, extracting a first voiceprint from the first voice signal, and determining a verification result of the first voiceprint.

The first voiceprint can be compared with the reference voiceprint to determine a verification result of the first voiceprint. Specifically, the similarity between the first voiceprint and the reference voiceprint can be determined, and when the similarity between the first voiceprint and the reference voiceprint is greater than the average similarity corresponding to the second time window, the first voiceprint is recorded to pass the verification; and when the similarity between the first voiceprint and the reference voiceprint is not greater than the average similarity corresponding to the second time window, recording that the first voiceprint is not verified.

In an optional manner, whether the corresponding first voiceprint passes the verification may be recorded by using a bit, for example, when the first voiceprint passes the verification, a value of the corresponding bit is recorded as 1; and when the first voiceprint is not verified, recording the value of the corresponding bit as 0. In the above technical solution, it is only necessary to store the verification results of the plurality of first voiceprints without storing the plurality of first voiceprints, and each verification result can occupy one bit, thereby contributing to reducing the storage space.

Step 603, determining whether the reference voiceprint needs to be updated according to the verification result of the first voiceprint in the first time window. If no update is needed, the voiceprint verification process can be executed again. If the update is needed, the voiceprint update process can be executed.

The multiple first voiceprints can be collected in the first time window, that is, the verification results of the multiple first voiceprints can be determined. Whether to update the reference voiceprint of the first user can be determined based on the verification of the plurality of first voiceprints.

In an alternative example, the ratio of the verified first voiceprints (which may be called the reach rate) may be counted, i.e. the ratio of the number of verified first voiceprints to the number of first voiceprints is counted. The ratio is greater than a ratio threshold, that is, the voiceprint of the current first user is not changed for a long time, and the reference voiceprint of the first user does not need to be updated. The ratio is less than or equal to the ratio threshold, that is, the voiceprint of the current first user is changed permanently, and the baseline voiceprint of the first user needs to be updated.

For example, assume that the second time window corresponds to an average similarity of 0.85 and the ratio threshold is 70%. The method includes the steps that 5 first voiceprints are obtained in a first time window in total, according to the 5 first voiceprints and a reference voiceprint, the similarity between the 5 first voiceprints and the reference voiceprint is determined to be 0.90, 0.80 and 0.86, the verification results of the 5 first voiceprints can be 1, 0 and 1 respectively, and then the ratio of the first voiceprints which pass verification in the first time window is determined to be 80% (larger than a ratio threshold value 70%), and the reference voiceprint of a first user does not need to be updated.

Here, in order to improve the verification accuracy, a first time threshold may be preset, and when the number of the verification results of the first voiceprint in the first time window for which the first preset duration is determined is smaller than the first time threshold, the duration of the first time window may be automatically extended until it is determined that the number of the verification results of the first voiceprint in the first time window reaches the first time threshold. For example, if the first time threshold is 10, and only 8 first voiceprints are collected within t1-t2 as shown in fig. 5 (a), the duration of the first time window may be automatically extended, for example, the 10 th first voiceprint is not collected until t2' is reached, and it may be determined that the end point of the first time window is t2', that is, the first time window is t1-t2'. Further, it may be determined whether to update the reference voiceprint based on the verification results of the 10 first voiceprints.

In this step, an average value of similarities between the respective first voiceprints in the first time window and the reference voiceprint (which may be referred to as an average similarity of the first time window) may be further determined, where the average similarity of the first time window is used to indicate a similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint. The average similarity of the first time window can be used for updating the subsequent reference voiceprint, which can be specifically referred to the following embodiments.

At step 604, the first time window is slid.

In an alternative, the first time window may be slid backward for a third preset time period, and the third preset time period may be shorter than the first preset time period, or may be shorter than the second preset time period. Wherein the end point of the first time window after sliding is later than the end point of the first time window before sliding. For example, referring to (a) and (b) in fig. 5, the first time window is slid from t1-t2 to t3-t4 before the sliding, wherein the interval between t2 and t4 is a third preset time period. In one specific implementation, the first preset time period may be 7 days, the second preset time period may be 7 days, and the third preset time period may be 1 day.

Since in some embodiments, the length of the first time window may be extended until the number of times of the verification result of the first voiceprint in the first time window reaches the first time threshold, that is, the duration of the first time window is variable, there may be a case where the duration of the first time window before sliding is greater than the first preset duration. In this case, the end point of the first time window may be slid backward by a third preset time period, and then the start point of the slid first time window may be determined based on the end point of the slid first time window and the first preset time period.

For example, referring to fig. 7, for example, in a time period from t1 to t2 (a first preset time interval between t1 and t 2), if the determined number of times of the verification result of the first voiceprint in the first time window does not reach the first time threshold, the termination point of the first time window is moved back to t2 '(the interval between t1 and t2' is greater than the first preset time interval). When the first time window is slid, a time point corresponding to t2'+ a third preset time period may be taken as a termination point of the first time window after the sliding, for example, denoted as t4'. Subsequently, it can be determined that the starting point of the first time window after sliding is t4 '-a first preset time duration, such as denoted t3'.

Furthermore, the first voiceprint of the first user can be continuously collected in the first time window after sliding, the similarity between the collected first voiceprint and the reference voiceprint is determined, and then the verification result of the first voiceprint is determined. The verification results of the plurality of first voiceprints can be determined in the first time window after sliding, and whether the reference voiceprint of the first user needs to be updated or not is determined according to the verification results of the plurality of first voiceprints. I.e. after step 604, the above steps 601 to 603 may be continued until the reference voiceprint of the first user is updated.

Illustratively, referring to (a) to (c) of FIG. 5, the first time window may be gradually slid from t1-t2 to t5-t6. Further, for example, when the first time window is slid to t5-t6, it is determined that the voiceprint update procedure needs to be executed according to the verification results of the plurality of first voiceprints acquired in t5-t6, that is, the voiceprint update procedure from step 605 to step 606 may be started at t6.

Step 605, a second voice signal is obtained, wherein a difference between the similarity between the second voice signal and the reference voiceprint and the average similarity of the first time window is smaller than a difference threshold.

The voice signal of the first user can be collected after the voiceprint updating process is started, whether the voice signal meets the preset condition or not is further determined, if yes, the voice signal is used for updating the reference voiceprint, and otherwise, the voice signal is discarded.

Illustratively, after the voiceprint update procedure is started, a voiceprint is determined from the collected voice signal, and the similarity between the voiceprint and the reference voiceprint is determined. If the difference between the similarity and the average similarity of the first time window is less than the difference threshold, the collected voice signal (i.e., the second voice signal) may be used to update the reference voice print. If the difference between the similarity and the average similarity of the first time window is not less than the difference threshold, the collected voice signal may be discarded. Illustratively, the difference threshold value is 0.1.

Step 606, updating the reference voiceprint according to the second voice signal.

In the present application, one or more second voice signals may be acquired after the voiceprint update procedure is started. The voice signal after the deduplication operation or the deduplication operation and the splicing operation may be obtained by performing the deduplication operation or the deduplication operation and the splicing operation on one or more second voice signals, and then extracting the voiceprint therefrom.

In the present application, it is considered that when the first user issues a voice signal, a high-frequency word, such as a wakeup word, may exist in the voice signal, and the accuracy of extracting a voiceprint in the voice signal may be affected by the excessive occurrence of the high-frequency word. In this way, a deduplication operation needs to be performed on the second speech signal, specifically, the ASR is performed on the second speech signal to obtain a text corresponding to the second speech signal, and the deduplication operation is performed on the text corresponding to the second speech signal to obtain a text without repeated text (which may be referred to as a deduplication text). Further, based on the deduplicated text and the second speech signal, a speech signal after the deduplication operation (which may be referred to as a deduplicated speech signal) is obtained.

Furthermore, in order to improve the accuracy of voiceprint updating, an updating condition can be preset, and when the duplicate removal voice signal meets the updating condition, the reference voiceprint can be updated according to the duplicate removal voice signal. For example, the update condition may be that the duration of the deduplicated speech signal is greater than a duration threshold, and/or the number of words in the non-repeated text (also referred to as deduplicated text) corresponding to the deduplicated speech signal is greater than a word number threshold. For convenience of description, in the following examples, the update condition is that the duration of the deduplicated speech signal is greater than the duration threshold, and the number of words in the non-repetitive text corresponding to the deduplicated speech signal is greater than the word number threshold.

For example, as shown in fig. 5 (d), the process of updating the reference voiceprint is started at t6, then one or more second voice signals are collected, until t7, the duplication-removed voice signals meeting the updating condition are obtained, and the reference voiceprint is updated according to the duplication-removed voice signals at t 7. T6-t7 may be understood as a third time window, i.e. a window for acquiring one or more second speech signals.

The above steps 605 and 606 are explained below with reference to a flowchart of fig. 8 for exemplary providing a method for updating a reference voiceprint.

Step 801, a 1 st second voice signal is obtained.

Step 802, determining a duplication eliminating voice signal corresponding to the 1 st second voice signal according to the 1 st second voice signal.

Specifically, performing ASR on the 1 st second voice signal to obtain a text corresponding to the 1 st second voice signal; and performing a deduplication operation on the text corresponding to the 1 st second speech signal to obtain a deduplication text (which may be referred to as a first deduplication text) corresponding to the 1 st second speech signal.

For example, if the first user wants to wake up a device, and the wake-up word of the device is "small a", the first user may issue a second speech signal "small a", and after the second speech signal is collected, perform ASR processing on the second speech signal to obtain a text corresponding to the second speech signal, i.e., "small a". Further, if it is determined that the repeated text "small a" is included in "small a, a deduplication operation may be performed, and the obtained first deduplication text is" small a ". Similarly, if the first user issues the second voice signal "small a, please turn on the air conditioner", the first duplication removal text obtained according to the second voice signal is "small a please turn on the air conditioner".

Of course, there may be no repeated text in the text corresponding to the second voice signal, for example, the first user issues the second voice signal "hello, small a", or "small a, please turn on the air conditioner", or "small a, please turn on the music", etc. At this time, after the ASR is performed on the second speech signal, if it is determined that there is no repeated text in the text corresponding to the second speech signal, the text corresponding to the second speech signal is directly used as the first deduplicated text.

Further, a speech signal corresponding to the first de-duplicated text in the 1 st second speech signal may be determined, so as to obtain a speech signal (i.e., the first de-duplicated speech signal) obtained by performing a de-duplication operation on the 1 st second speech signal.

For example, if the 1 st second voice signal is "small a, please turn on the air conditioner", the corresponding relationship between the "small a, please turn on the air conditioner" and the voice signal segment in the 1 st second voice signal is as shown in table 1.

TABLE 1

Text	Speech signal segment
Small A (First one)	Speech signal segment 11
Small A (second)	Speech signal segment 12
Please turn on the air conditioner	Speech signal segment 13

After determining that the first deduplication text is "small a please turn on the air conditioner", the first deduplication speech signal may be determined according to the speech signal segments respectively corresponding to the first deduplication text "small a please turn on the air conditioner" in the 1 st second speech signal. With reference to table 1, the speech signal segment 11 corresponding to "small a (first)" and the speech signal segment 13 corresponding to "please turn on the air conditioner" are spliced to obtain a first de-duplicated speech signal.

In this example, since the speech signal segment corresponding to the "small a" in the 1 st second speech signal in the first deduplication text may be the speech signal segment 11 or the speech signal segment 12, during the splicing, the speech signal segment 11 and the speech signal segment 13 may be selected to be spliced to obtain the first deduplication speech signal, or the speech signal segment 12 and the speech signal segment 13 may be selected to be spliced to obtain the first deduplication speech signal.

It should be noted that the above is only an example to illustrate how to obtain the deduplicated speech signal from the second speech signal according to the deduplicated text, and when the second speech signal is divided according to the text, it can also be as shown in table 2. Of course, the second speech signal may be divided into other manners, and the present application is not limited thereto.

TABLE 2

Text	Speech signal segment
Small A (first)	Speech signal segment 11
Small A (second)	Speech signal segment 12
Please open	Speech signal segment 14
Air conditioner	Speech signal segment 15

In step 803, if it is determined that the first de-duplication speech signal meets the update condition, step 807 is executed. If it is determined that the first de-duplicated speech signal does not meet the update condition, step 804 is performed.

Step 804, a 2 nd second voice signal is obtained.

Step 805, determining, according to the 2 nd second speech signal and the first deduplication text, a deduplication speech signal corresponding to the 1 st second speech signal and the 2 nd second speech signal.

It should be noted that, before the 2 nd second voice signal is obtained, the 1 st second voice signal is already obtained, and the 1 st second voice signal is processed to obtain the first deduplication text, and the first deduplication text may be used as the history deduplication text of the 2 nd second voice signal. After the 2 nd second speech signal is subjected to ASR processing and a text corresponding to the 2 nd second speech signal is obtained, a deduplication operation is performed on the text corresponding to the 2 nd second speech signal according to a history deduplication text (a first deduplication text), and a deduplication text (which can be called as a second deduplication text) corresponding to the 1 st second speech signal and the 2 nd second speech signal is obtained.

In an alternative manner, the text corresponding to the 2 nd second voice signal is subjected to a deduplication operation, then the text corresponding to the 2 nd second voice signal after the deduplication operation is compared with the first deduplication text, so that repeated texts in the text corresponding to the 2 nd second voice signal after the deduplication operation and the first deduplication text are deleted, and the obtained text is spliced to the first deduplication text to obtain the second deduplication text.

Illustratively, the first deduplication text is "small a please turn on the air conditioner", for example. The 2 nd second voice signal is 'small A, air conditioner on-high point', ASR is executed on the 2 nd second voice signal to obtain a text 'small A air conditioner on-high point', and the text 'small A air conditioner on-high point' is obtained after the deduplication operation is executed. Then, according to the first duplicate removal text "small a please turn on the air conditioner" and "small a air conditioner on high point", the second duplicate removal text is obtained as "small a please turn on the air conditioner high point".

In another alternative, the text corresponding to the 2 nd second speech signal may be spliced to the first de-duplicated text, and then the spliced text is subjected to de-duplication operation to obtain the second de-duplicated text.

Illustratively, the first deduplication text is "small a please turn on the air conditioner", for example. The 2 nd second speech signal is "small a, air conditioner on high point", and the ASR is executed on the 2 nd second speech signal to obtain a text "small a air conditioner on high point". And splicing the small A and small A air conditioner high point to the first duplication removal text which is the small A air conditioner opening request, obtaining the small A air conditioner opening request small A and small A air conditioner high point, and further executing duplication removal operation to obtain the small A air conditioner opening request high point.

Of course, other modes of deduplication operations and splicing operations are also possible, and are not listed in this application.

After the second de-duplicated text is obtained, the voice signal corresponding to the second de-duplicated text in the 1 st second voice signal and the voice signal corresponding to the second de-duplicated text in the 2 nd second voice signal may be determined, so as to obtain the voice signal (i.e., the second de-duplicated voice signal) after the de-duplication operation and the splicing operation are performed on the 1 st second voice signal and the 2 nd second voice signal.

For example, the 1 st second voice signal is "small a, please turn on the air conditioner", and the 2 nd second voice signal is "small a, air conditioner on-high point". Suppose that "small a, please turn on the air conditioner" corresponds to the 1 st speech signal segment in the second speech signal as shown in table 1. The correspondence between "small a, air conditioner on-high point" and the voice signal segment in the 2 nd second voice signal is shown in table 3.

TABLE 3

Text	Speech signal segment
Small A (first)	Speech signal segment 21
Small A (second)	Speech signal segment 22
Air conditioner switch	Speech signal segment 23
High spot	Speech signal segment 24

After determining that the second de-duplication text is "small a please open the air-conditioning high point", the second de-duplication speech signal may be determined according to the speech signal segments of the second de-duplication text "small a please open the air-conditioning high point" respectively corresponding to the 1 st second speech signal and the 2 nd second speech signal. For example, the speech signal segment 11 corresponding to "small a (first)", the speech signal segment 13 corresponding to "please turn on the air conditioner", and the speech signal segment 24 corresponding to "high point" are spliced into the second de-duplicated speech signal. Of course, the determined second de-duplicated speech signal may also be formed by splicing the speech signal segment 12, the speech signal segment 13 and the speech signal segment 24.

After the second de-duplicated speech signal is obtained, it may be determined whether the second de-duplicated speech signal meets the update condition (i.e., the determination condition in step 803 is executed again), and if yes, step 807 is executed.

If the second duplicate removal voice signal is determined not to be in accordance with the updating condition, continuously acquiring a 3 rd second voice signal, performing ASR processing on the 3 rd second voice signal to obtain a text corresponding to the 3 rd second voice signal, and then performing a duplicate removal operation on the text corresponding to the 3 rd second voice signal according to a historical duplicate removal text (second duplicate removal text) to obtain a duplicate removal text (which can be called as a third duplicate removal text) jointly corresponding to the 1 st second voice signal to the 3 rd second voice signal. A de-emphasized speech signal (which may be referred to as a third de-emphasized speech signal) is determined from the 1 st second speech signal through the 3 rd second speech signal based on the third de-emphasized text. And determining whether the third de-emphasis speech signal meets the updating condition.

By repeating the above operations, step 806 may be executed to obtain the ith second voice signal, and determine the deduplication voice signals corresponding to the 1 st second voice signal to the ith second voice signal according to the ith second voice signal and the historical deduplication text of the ith second voice signal, where the historical deduplication text is determined according to texts corresponding to the 1 st second voice signal to the i-1 st second voice signal, respectively. Until the obtained de-duplicated speech signal meets the updating condition.

To better explain the examples of the present application, it is explained in conjunction with table 4. It is assumed that the update condition is that the duration of the de-duplicated speech signal is greater than 5s (i.e. the long threshold is 5 s), and the number of characters in the non-duplicated text corresponding to the de-duplicated speech signal is greater than 12 (i.e. the word number threshold is 12). In the process of updating the voiceprint:

(1) And acquiring the 1 st second voice signal as 'hello small A', executing ASR to obtain a text 'hello small A' corresponding to the 1 st second voice signal, wherein the text does not contain repeated texts, namely the first duplication removing text is 'hello small A', the word number is 4, and the time length of the first duplication removing voice signal is 1.5s. At this time, the number of words of the first deduplicated text is not more than 12, and the duration of the first deduplicated speech signal is not more than 5s, the 2 nd second speech signal can be further acquired.

(2) The 2 nd second voice signal is acquired as ' turn on the air conditioner ', the ASR is executed to obtain a text ' turn on the air conditioner ' corresponding to the 2 nd second voice signal, the ' turn on the air conditioner ' is subjected to de-duplication operation and splicing operation according to the historical de-duplication text ' Niuhai Xiao A ', the second de-duplication text ' Niuhai Xiao A ' turns on the air conditioner ' is obtained, the word number is 8, and the duration of the corresponding second de-duplication voice signal is 3s. At this time, the number of words of the second deduplicated text is not more than 12, and the duration of the second deduplicated speech signal is not more than 5s, then the 3 rd second speech signal can be further acquired.

(3) The 3 rd second voice signal is acquired as an air conditioner on high point, ASR is executed to acquire a text ' air conditioner on high point ' corresponding to the 3 rd second voice signal, de-duplication operation and splicing operation are carried out on the ' air conditioner on high point ' according to a historical de-duplication text ' turning on the air conditioner by the aid of the ' your good Xiao A ', a third de-duplication text ' turning on the air conditioner by the aid of the your good Xiao A ', the word number is 10, and the duration of the corresponding third de-duplication voice signal is 3.5s. At this time, the number of words of the third deduplicated text is not more than 12, and the duration of the third deduplicated speech signal is not more than 5s, then the 4 th second speech signal can be further acquired.

(4) The 4 th second voice signal is acquired as 'hello small A', ASR is executed to obtain a text 'hello small A' corresponding to the 4 th second voice signal, the 'hello small A' is subjected to de-duplication operation and splicing operation according to the historical de-duplication text 'hello small A opens the air conditioner high point', the word number is 10, and the time length of the corresponding fourth de-duplication voice signal is 3.5s. At this time, the number of words of the fourth deduplicated text is not more than 12, and the duration of the fourth deduplicated speech signal is not more than 5s, then the 5 th second speech signal can be further acquired.

(5) The method comprises the steps of obtaining a 5 th second voice signal as 'open skylight', executing ASR to obtain a text 'open skylight' corresponding to the 5 th second voice signal, carrying out de-duplication operation and splicing operation on 'open skylight' according to a historical de-duplication text 'Niuhao A opens air conditioner high point', obtaining a fifth de-duplication text 'Niuhao A opens air conditioner high point skylight', wherein the word number is 12, and the time length of the corresponding fifth de-duplication voice signal is 4s. At this time, the number of words of the fifth deduplicated text is not more than 12, and the duration of the fifth deduplicated speech signal is not more than 5s, then the 6 th second speech signal can be further acquired.

(6) The 6 th second voice signal is acquired as 'playing music', ASR is executed to acquire a text 'playing music' corresponding to the 6 th second voice signal, the 'playing music' is subjected to duplication removing operation and splicing operation according to a historical duplication removing text 'Niuhao Xiao A opens an air conditioner high point skylight', a sixth duplication removing text 'Niuhao Xiao A opens the air conditioner high point skylight to play music', the word number is 16, and the time length of the corresponding sixth duplication removing voice signal is 5.5s. At this time, the number of words of the sixth deduplicated text is greater than 12, and the duration of the sixth deduplicated speech signal is greater than 5s.

Thus, it is determined that the sixth de-duplicated speech signal meets the update condition, and step 807 is performed according to the sixth de-duplicated speech signal.

TABLE 4

	Second speech signal	De-duplicated text	Number of words	Duration of time
1	Nihao Xiao A	Nihao Xiao A	4	1.5s
2	Turning on the air conditioner	Nihao Xiao A open air conditioner	8	3s
3	Air conditioner high point	Hello small a open air conditioner high point	10	3.5s
4	Nihaoxiao A	Hello small a open air conditioner high point	10	3.5s
5	Open skylight	Nihao small A opening air conditioner high point skylight	12	4s
6	Playing music	Nihao Xiao A opening air conditioner high point skylight to play music	16	5.5s

In step 807, a third voiceprint of the first user is determined from the de-duplicated speech signal that meets the update condition.

Illustratively, according to the duplication-removed speech signal meeting the update condition, the voiceprint extraction process exemplarily shown in fig. 1c is performed to obtain the third voiceprint of the first user.

And step 808, updating the reference voiceprint of the first user according to the third voiceprint.

In an alternative implementation, the first user's reference voiceprint may be actively updated, that is, the third voiceprint replaces the original reference voiceprint. In another alternative implementation, the first user may be prompted whether to update the reference voiceprint.

In the manner of prompting the first user whether to update the reference voiceprint, it can be explained with reference to the example in fig. 2, where the first user is a user a (owner of the vehicle), and the vehicle-mounted device determines that the voiceprint of the user a has changed for a long time, and then may prompt the user a in a display interface whether to automatically update the reference voiceprint.

For example, as in fig. 9, the in-vehicle device displays a prompt message "does you detect that a voiceprint is changed and update the detected voiceprint with an original voiceprint? ". If the user a clicks "good", the vehicle-mounted device may replace the original reference voiceprint with the acquired third voiceprint. If the user A does not click the 'i want to update by oneself', the vehicle-mounted device can further display a section of preset characters in the display interface, prompt the user A to read the section of preset characters, acquire the current voice signal of the user A, extract the voiceprint based on the voice signal, obtain the reference voiceprint of the user A, and store the reference voiceprint of the user A. Of course, in some other embodiments, when it is determined that the baseline voiceprint needs to be updated according to the verification result of the first voiceprint in the first time window, the user a may be prompted in the display interface to update the baseline voiceprint by himself.

After updating the reference voiceprint, the second time window may be subjected to a sliding operation, so as to obtain a slid second time window. Wherein the starting point of the second time window after sliding is not earlier than the time point of the reference voiceprint update; or the starting point of the second time window after sliding is not earlier than the acquisition time point of the second voice signal.

For example, the starting point of the second time window may be after the end point of the third time window or coincide with the end point of the third time window. As in fig. 5 (e), the start point of the second time window after sliding coincides with the end point of the third time window. Further, in a second time window after the reference voiceprint is updated (i.e., the second time window after sliding), one or more second voiceprints in the second time window after sliding may be obtained, and the average similarity of the second time window after sliding may be determined according to the similarities of the one or more second voiceprints with the updated reference voiceprint, respectively.

Further, after the average similarity of the second time windows is determined, the first time window is slid, and the end point of the slid first time window is later than the start point of the slid second time window. For example, the starting point of the first time window after sliding may be after the ending point of the second time window after sliding, or coincide with the ending point of the second time window after sliding. As in fig. 5 (e), the start point of the first time window after sliding coincides with the end point of the second time window after sliding. The first time window after sliding is specifically t8-t9, wherein the t8-t9 are separated by a first preset time length.

And acquiring one or more first voiceprints in the slid first time window, and determining the verification result of one or more first voiceprints according to the similarity between the one or more first voiceprints and the updated reference voiceprint and the average similarity of the slid second time window, so as to determine whether to update the reference voiceprint again.

In connection with a flow chart of yet another voiceprint management method, which is exemplarily shown in fig. 10, it is explained that:

and acquiring a voice signal of the first user, and preprocessing the voice signal of the first user to obtain audio characteristic information of the voice signal. And performing audio characteristic extraction according to the audio characteristic information, and determining the voiceprint of the first user. And executing post-processing to obtain the final voiceprint of the first user. In one case, the voiceprint of the first user is registered or updated as a reference voiceprint. In another case, it may be determined that the current time point is located in the first time window or in the second time window.

And when the current time point is positioned in the first time window, comparing the voiceprint (namely the first voiceprint) of the first user with the reference voiceprint to obtain the similarity between the first voiceprint and the reference voiceprint, and if the similarity is greater than the average similarity of the second time window, determining that the first voiceprint passes the verification. Determining a ratio of validated first voiceprints in the first time window (i.e. achievement rate in the first time window), and sliding the first time window if the achievement rate is greater than a ratio threshold; and if the standard reaching rate is less than or equal to the rate threshold, starting a process of updating the reference voiceprint.

And when the current time point is positioned in the second time window, comparing the voiceprint (namely the second voiceprint) of the first user with the reference voiceprint to obtain the similarity between the second voiceprint and the reference voiceprint. And determining the average value of the similarity between the plurality of second voiceprints and the reference voiceprint in the second time window as the average similarity.

It should be added that the present application refers to a plurality of thresholds, such as a ratio threshold, a similarity threshold, a difference threshold, a number of times threshold, a duration threshold, a number of words threshold, etc. In the determination process according to the threshold, the first result may be when the threshold is greater than or equal to the threshold, and the second result may be when the threshold is less than the threshold; the first result may be when the threshold is greater than the threshold, and the second result may be when the threshold is less than or equal to the threshold, which is not limited in the present application. For example, in determining whether to update the first user's reference voiceprint, although embodiments indicate that when the ratio is greater than the ratio threshold, the first user's reference voiceprint need not be updated (i.e., the first result); updating the first user's reference voiceprint (i.e., the second result) when the ratio is less than or equal to the ratio threshold, but it is equally possible in this application that the first user's reference voiceprint (i.e., the first result) need not be updated when the ratio is greater than or equal to the ratio threshold; when the ratio is less than the ratio threshold, the first user's reference voiceprint (i.e., the second result) is updated.

It is also necessary to supplement that, in the process of acquiring the text corresponding to the second speech signal by the voiceprint management device, the semantic understanding device may also execute ASR to obtain the text corresponding to the second speech signal, and then the voiceprint management device acquires the text corresponding to the second speech signal from the semantic understanding device, which is not limited in this application.

It should be understood that, in the above technical solution, the reference voiceprint of the first user is predetermined, then the second voiceprint of the first user is obtained in the second time window, the similarity between the second voiceprint and the reference voiceprint is determined, and the average similarity of the second time window is determined according to a plurality of similarities in the second time window. Since the voiceprint of the user is in a steady change after the time point of storing the reference voiceprint in the second time window, that is, in the time period from the time point of registering the reference voiceprint to the end point of the second time window, the reference parameter (that is, the average similarity between the reference voiceprint and the second time window) can be obtained based on the voiceprint of the first user in the steady change. In the subsequent process of determining whether the voiceprint of the user changes permanently, the similarity between the first voiceprint and the reference voiceprint is determined based on the first voiceprint of the user in the first time window, the similarity is compared with the average similarity of the second time window, the verification result of the first voiceprint is determined, that is, whether the first voiceprint in the first time window is similar to the second voiceprint in the second time window is determined, and further, whether the voiceprint of the user changes permanently in the time period from the starting point of the second time window to the ending point of the first time window is determined.

Moreover, the verification results of the plurality of first voiceprints in the first time window are determined, the rate of the first voiceprints passing the verification in the first time window (namely the standard reaching rate in the first time window) is determined, whether the voiceprints of the user change for a long time is determined according to the standard reaching rate in the first time window, misjudgment caused by accidental reasons of the user in the first time window is avoided, and the judgment accuracy is improved.

Furthermore, when the reference voiceprint is updated, the second voice signal is obtained, the voiceprint of the first user is determined according to the obtained second voice signal, interaction with the user is not needed, no perception of the user is achieved, and user experience is improved. And a plurality of first voiceprints in the first time window do not need to be stored in advance, which is beneficial to reducing the storage amount. In the application, a plurality of second fingerprints in the second time window do not need to be stored, and only the average similarity of the second time window needs to be stored, so that the storage capacity is further reduced.

The various embodiments described herein may be implemented as stand-alone solutions or combined in accordance with inherent logic and are intended to fall within the scope of the present application.

It is to be understood that, in the above embodiments of the method, the method and operations implemented by the voiceprint management apparatus may also be implemented by a component (e.g., a chip or a circuit) that can be used in the voiceprint management apparatus.

The division of the modules in the embodiment of the present application is schematic, and is only one logic function division, and there may be another division manner in actual implementation. In addition, functional modules in the embodiments of the present application may be integrated into one processor, may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Based on the above and the same concept, fig. 11 and 12 are schematic structural diagrams of a possible voiceprint management apparatus exemplarily provided by the present application. The voiceprint management devices can be used for realizing the functions of the method embodiment, and therefore, the beneficial effects of the method embodiment can also be realized.

As shown in fig. 11, the present application provides a voiceprint management apparatus, including: an acquisition module 1101 and a processing module 1102.

Illustratively, the obtaining module 1101 may be configured to perform the obtaining function of the voiceprint management apparatus in the method embodiment related to fig. 6, fig. 8 or fig. 10, such as a step of obtaining the first voice signal, a step of obtaining the second voice signal, or the like. Illustratively, the processing module 1102 may be configured to perform the processing functions of the voiceprint management apparatus in the method embodiment related to fig. 6, fig. 8 or fig. 10, such as a step of determining a verification result of the first voiceprint, or a step of determining whether to update the reference voiceprint, or a step of updating the reference voiceprint, etc.

In a possible implementation manner, the method is configured to obtain a verification result of a first voiceprint in a first time window, where the verification result of the first voiceprint is determined according to an average similarity of the first voiceprint, a reference voiceprint and a second time window, the first voiceprint is determined according to a first voice signal of a first user obtained in the first time window, the average similarity of the second time window is used to indicate a similarity between the voiceprint of the first user obtained in the second time window and the reference voiceprint, and a starting point of the first time window is later than a starting point of the second time window; the processing module 1102 is configured to update the reference voiceprint according to a verification result of the first voiceprint in the first time window.

In a possible implementation manner, the processing module 1102 is specifically configured to: and updating the reference voiceprint according to the ratio of the verified first voiceprints in the first time window.

In a possible implementation manner, the processing module 1102 is specifically configured to: and updating the reference voice print when the ratio of the first voice print passing the verification in the first time window is less than or equal to a ratio threshold value.

In a possible implementation manner, the processing module 1102 is specifically configured to: the control obtaining module 1101 obtains a second voice signal of the first user, where a difference between a similarity of the second voice signal and a reference voiceprint and an average similarity of a first time window is smaller than a difference threshold, and the average similarity of the first time window is used to indicate a similarity between the voiceprint of the first user obtained in the first time window and the reference voiceprint; the reference voiceprint is updated based on the second speech signal.

In a possible implementation manner, the processing module 1102 is specifically configured to: obtaining a duplication-removing voice signal according to the second voice signal and a text corresponding to the second voice signal, wherein the text corresponding to the duplication-removing voice signal is a non-repeated text; and updating the reference voiceprint according to the duplicate removal voice signal.

In a possible implementation manner, the second speech signal is one, and the processing module 1102 is specifically configured to: executing duplication removing operation on the text corresponding to the second voice signal to obtain duplication removing text; and obtaining the de-duplicated speech signal according to the speech signal corresponding to the de-duplicated text in the second speech signal.

In a possible implementation manner, the number of the second speech signals is multiple, and the processing module 1102 is specifically configured to: for an ith second speech signal of the plurality of second speech signals: according to the historical duplication removing text, executing duplication removing operation on the text corresponding to the ith second voice signal; splicing the text corresponding to the ith second voice signal after the duplication elimination operation with the historical duplication elimination text to obtain a duplication elimination text corresponding to the 1 st voice signal to the ith second voice signal; the historical de-duplication text is obtained according to texts corresponding to the 1 st second voice signal to the (i-1) th second voice signal, and i is larger than 1; and obtaining the de-duplicated speech signal according to the speech signal corresponding to the de-duplicated text commonly corresponding to the second speeches in the second speech signals.

In one possible implementation, the duration of the deduplicated speech signal is greater than a duration threshold, and/or the number of words in the non-repetitive text corresponding to the deduplicated speech signal is greater than a word number threshold.

In one possible implementation, the processing module 1102 is further configured to: and sliding the first time window, wherein the end point of the first time window after sliding is later than the end point of the first time window before sliding.

In one possible implementation, the processing module 1102 is further configured to: and sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding.

In a possible implementation manner, before the obtaining module 1101 obtains the verification result of the first voiceprint in the first time window, the processing module 1102 is further configured to: determining the similarity in the second time window according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window; and determining the average similarity of the second time window according to the plurality of similarities in the second time window.

As shown in fig. 12, which is a device provided in the embodiment of the present application, the device shown in fig. 12 may be implemented as a hardware circuit of the device shown in fig. 11. The apparatus may be adapted to perform the above-described method embodiments in the previously illustrated flow charts.

For ease of illustration, fig. 12 shows only the main components of the device.

The voiceprint management apparatus includes: a processor 1210 and an interface 1230, optionally the voiceprint management apparatus further comprises a memory 1220. The interface 1230 is used to enable communication with other devices.

The method executed by the voiceprint management device in the above embodiment can be implemented by the processor 1210 calling a program stored in a memory (which may be the memory 1220 in the voiceprint management device, or an external memory). That is, the voiceprint management apparatus may include a processor 1210, and the processor 1210 executes the method performed by the voiceprint management apparatus in the above method embodiment by calling a program in a memory. The processor here may be an integrated circuit with signal processing capabilities, such as a CPU. The voiceprint management apparatus may be implemented by one or more integrated circuits configured to implement the above method. For example: one or more ASICs, or one or more microprocessors DSP, or one or more FPGAs, etc., or a combination of at least two of these integrated circuit forms. Alternatively, the above implementations may be combined.

Specifically, the functions/implementation processes of the processing module 1102 and the obtaining module 1101 in fig. 11 can be implemented by the processor 1210 in the voiceprint management apparatus shown in fig. 12 calling the computer execution instructions stored in the memory 1220.

Based on the above and the same idea, the present application provides a computer program product comprising a computer program or instructions for implementing the method in the above method embodiments when the computer program or instructions are executed by a computing device.

Based on the above and the same idea, the present application provides a computer-readable storage medium, in which a computer program or instructions are stored, which, when executed by a computing device, implement the method in the above-described method embodiments.

Based on the above and the same idea, the present application provides a computing device comprising a processor connected to a memory for storing a computer program, the processor being configured to execute the computer program stored in the memory, so as to cause the computing device to perform the method in the above method embodiments.

Based on the above and the same conception, an embodiment of the present application provides a chip system, including: a processor coupled to the memory, and a memory for storing a program or instructions which, when executed by the processor, cause the system-on-chip to implement the method of the above-described method embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

A voiceprint management method, comprising:

obtaining a verification result of a first voiceprint in a first time window, wherein the verification result of the first voiceprint is determined according to average similarities of the first voiceprint, a reference voiceprint and a second time window, the first voiceprint is determined according to a first voice signal of a first user obtained in the first time window, the average similarity of the second time window is used for indicating a similarity condition of the voiceprint of the first user obtained in the second time window and the reference voiceprint, and a starting point of the first time window is later than a starting point of the second time window;

and updating the reference voiceprint according to the verification result of the first voiceprint in the first time window.
The method of claim 1, wherein the validation result for the first voiceprint is determined based on an average similarity of the first voiceprint, a reference voiceprint, a second time window, comprising:

and when the similarity of the first voiceprint and the reference voiceprint is greater than the average similarity of the second time window, the verification result of the first voiceprint is verification passing.
A method according to claim 1 or 2, wherein said updating said reference voiceprint in dependence on a result of said verification of said first voiceprint in said first time window comprises:

and updating the reference voiceprint according to the ratio of the verified first voiceprints in the first time window.
The method of claim 3, wherein said updating the reference voiceprint based on a ratio of validated first voiceprints in the first time window comprises:

updating the reference voiceprint when a ratio of validated first voiceprints in the first time window is less than or equal to a ratio threshold.
The method of any of claims 1 to 4, wherein said updating said reference voiceprint comprises:

acquiring a second voice signal of the first user, wherein the similarity between the second voice signal and the reference voiceprint is smaller than a difference threshold value with respect to the average similarity of the first time window, and the average similarity of the first time window is used for indicating the similarity between the voiceprint of the first user acquired in the first time window and the reference voiceprint;

and updating the reference voice print according to the second voice signal.
The method of claim 5, wherein said updating the reference voiceprint based on the second speech signal comprises:

according to the second voice signal and the text corresponding to the second voice signal, obtaining a duplicate removal voice signal, wherein the text corresponding to the duplicate removal voice signal is a non-duplicate text;

and updating the reference voiceprint according to the de-duplication speech signal.
The method of claim 6, wherein the second speech signal is one, and wherein obtaining the de-duplicated speech signal based on the second speech signal and the corresponding text of the second speech signal comprises:

executing a duplication removing operation on the text corresponding to the second voice signal to obtain a duplication removing text;

and obtaining the duplication-removing voice signal according to the voice signal corresponding to the duplication-removing text in the second voice signal.
The method of claim 6, wherein the second speech signal is plural, and the obtaining the de-duplicated speech signal according to the second speech signal and the corresponding text of the second speech signal comprises:

for an ith second speech signal of the plurality of second speech signals:

according to a historical duplication removing text, executing duplication removing operation on a text corresponding to the ith second voice signal; splicing the text corresponding to the ith second voice signal after the duplication removing operation with the historical duplication removing text to obtain duplication removing texts jointly corresponding to the 1 st voice signal and the ith second voice signal; the historical de-duplicated text is obtained according to texts corresponding to the 1 st second voice signal to the i-1 st second voice signal respectively, wherein i is larger than 1;

and obtaining the de-duplicated speech signal according to the speech signal corresponding to the de-duplicated text commonly corresponding to the second speeches in the second speech signals.
The method of any of claims 6 to 8, wherein the duration of the de-duplicated speech signal is greater than a duration threshold, and/or the number of words in the corresponding non-repeated text of the de-duplicated speech signal is greater than a word count threshold.
The method of any of claims 1 to 9, further comprising:

and sliding the first time window, wherein the termination point of the first time window after sliding is later than the termination point of the first time window before sliding.
The method of any one of claims 1 to 10, wherein the length of the first time window is variable.
The method of any of claims 1 to 11, further comprising:

and sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding.
The method of any of claims 1 to 12, wherein prior to obtaining the verification of the first voiceprint in the first time window, further comprising:

determining similarity in the second time window according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window;

and determining the average similarity of the second time window according to the plurality of similarities in the second time window.
A voiceprint management apparatus, comprising:

the device comprises an acquisition module and a processing module;

the obtaining module is configured to obtain a verification result of a first voiceprint in a first time window, where the verification result of the first voiceprint is determined according to the first voiceprint, a reference voiceprint, and an average similarity of a second time window, where the first voiceprint is determined according to a first voice signal of a first user obtained in the first time window, the average similarity of the second time window is used to indicate a similarity between the voiceprint of the first user obtained in the second time window and the reference voiceprint, and a starting point of the first time window is later than a starting point of the second time window;

the processing module is configured to update the reference voiceprint according to a verification result of the first voiceprint in the first time window.
The apparatus of claim 14, wherein the validation result for the first voiceprint is determined based on an average similarity of the first voiceprint, the reference voiceprint, the second time window, comprising:

and when the similarity of the first voiceprint and the reference voiceprint is greater than the average similarity of the second time window, the verification result of the first voiceprint is verification passing.
The apparatus according to claim 14 or 15, wherein the processing module is specifically configured to:

and updating the reference voiceprint according to the ratio of the verified first voiceprints in the first time window.
The apparatus of claim 16, wherein the processing module is specifically configured to:

updating the reference voiceprint when a ratio of validated first voiceprints in the first time window is less than or equal to a ratio threshold.
The apparatus according to any one of claims 14 to 17, wherein the processing module is specifically configured to:

controlling the obtaining module to obtain a second voice signal of the first user, wherein a difference between a similarity of the second voice signal and the reference voiceprint and an average similarity of the first time window is smaller than a difference threshold, and the average similarity of the first time window is used for indicating a similarity between the voiceprint of the first user obtained in the first time window and the reference voiceprint;

and updating the reference voiceprint according to the second voice signal.
The apparatus of claim 18, wherein the processing module is specifically configured to:

obtaining a duplication-removing voice signal according to the second voice signal and a text corresponding to the second voice signal, wherein the text corresponding to the duplication-removing voice signal is a non-repetitive text;

and updating the reference voiceprint according to the duplication-removing voice signal.
The apparatus of claim 18, wherein the second speech signal is one, and wherein the processing module is specifically configured to:

performing de-duplication operation on the text corresponding to the second voice signal to obtain a de-duplicated text;

and obtaining the duplication-removing voice signal according to the voice signal corresponding to the duplication-removing text in the second voice signal.
The apparatus according to claim 18, wherein the second speech signal is plural, and the processing module is specifically configured to:

for an ith one of the plurality of second speech signals:

according to the historical duplication removing text, carrying out duplication removing operation on the text corresponding to the ith second voice signal; splicing the text corresponding to the ith second voice signal after the duplication removing operation with the historical duplication removing text to obtain duplication removing texts jointly corresponding to the 1 st voice signal and the ith second voice signal; the historical de-duplicated text is obtained according to texts corresponding to the 1 st second voice signal to the i-1 st second voice signal respectively, wherein i is larger than 1;

and obtaining the de-duplicated speech signal according to the speech signal corresponding to the de-duplicated text commonly corresponding to the second speeches in the second speech signals.
The apparatus according to any of claims 18 to 21, wherein the duration of the de-duplicated speech signal is greater than a duration threshold, and/or the number of words in the corresponding non-repeated text of the de-duplicated speech signal is greater than a word count threshold.
The apparatus of any of claims 14 to 22, wherein the processing module is further to:

and sliding the first time window, wherein the end point of the first time window after sliding is later than the end point of the first time window before sliding.
The apparatus of any of claims 14 to 23, wherein the length of the first time window is variable.
The apparatus of any of claims 14 to 24, wherein the processing module is further to:

and sliding the second time window, wherein the end point of the second time window after sliding is later than the end point of the second time window before sliding, and the starting point of the second time window after sliding is earlier than the starting point of the first time window after sliding.
The apparatus according to any one of claims 14 to 25, wherein before the obtaining module obtains the verification result of the first voiceprint in the first time window, the processing module is further configured to:

determining the similarity in the second time window according to the reference voiceprint and the second voiceprint of the first user acquired in the second time window;

and determining the average similarity of the second time window according to the plurality of similarities in the second time window.
A computer program product comprising a computer program or instructions for implementing the method of any one of claims 1 to 13 when executed by a computing device.
A computer-readable storage medium, having stored thereon a computer program or instructions which, when executed by a computing device, carry out the method of any one of claims 1 to 13.
A computing device comprising a processor coupled to a memory, the memory configured to store a computer program, the processor configured to execute the computer program stored in the memory to cause the computing device to perform the method of any of claims 1 to 13.