CN115954006A

CN115954006A - Registration frequency self-adaptive voiceprint recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115954006A
Application number: CN202211627928.3A
Authority: CN
Inventors: 于伟维; 陈锦明; 李倩; 巩宁
Original assignee: Bestechnic Shanghai Co Ltd
Current assignee: Bestechnic Shanghai Co Ltd
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-04-11

Abstract

The application provides a registration frequency self-adaptive voiceprint recognition method and device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a Gaussian mixture model serving as a general background model; acquiring registered audio data of a target user, and constructing a training sample based on the registered audio data; updating the model parameters of the Gaussian mixture model by the training sample according to a maximum posterior probability estimation algorithm to obtain an updated Gaussian mixture model; judging whether the updated model parameters of the Gaussian mixture model are significantly different from the initially acquired model parameters of the general background model or not during the current registration based on a significant difference algorithm; and determining whether to execute the next registration process according to the judgment result. According to the scheme, the problem caused by too many or too few registration times can be avoided.

Description

Registration frequency self-adaptive voiceprint recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speaker identification technologies, and in particular, to a method and an apparatus for voiceprint identification with adaptive registration times, an electronic device, and a computer-readable storage medium.

Background

Speaker Recognition (SR), also known as Voiceprint Recognition (VPR), is a biometric identification technique that identifies the identity of a Speaker based on the Speaker's personality information in a speech signal. In order to realize the voiceprint recognition service of the target user, the voice data of the target user needs to be registered on the general background model, so that a Gaussian mixture model corresponding to the target user is obtained, and the Gaussian mixture model can represent the voiceprint characteristics of the target user. In the related art, registration is performed by presetting the number of times of registration. However, there is not one universally valid registration count for different registration flows. If the registration times are too few, the voice print characteristics of the target user cannot be accurately represented by the Gaussian mixture model obtained through registration, and the subsequent voice print recognition effect is possibly poor. If the registration times are too many, the problems of power consumption, memory resource waste and the like can be caused by redundant registration.

Disclosure of Invention

An object of the embodiments of the present application is to provide a voiceprint recognition method and apparatus with adaptive registration times, an electronic device, and a computer-readable storage medium, which are used to adaptively adjust the registration times according to the actual situation of a registration process, so as to complete registration with the most appropriate registration times, and avoid various problems caused by too few or too many registration times.

In one aspect, the present application provides a voiceprint recognition method with adaptive registration times, including:

acquiring a Gaussian mixture model serving as a general background model;

acquiring registered audio data of a target user, and constructing a training sample based on the registered audio data;

updating the model parameters of the Gaussian mixture model by the training sample according to a maximum posterior probability estimation algorithm to obtain an updated Gaussian mixture model;

judging whether the updated model parameters of the Gaussian mixture model are significantly different from the initially acquired model parameters of the general background model or not during the current registration based on a significant difference algorithm;

and determining whether to execute the next round of registration process according to the judgment result.

In an embodiment, before the obtaining the gaussian mixture model as the general background model, the method further comprises:

acquiring sample audio data of a plurality of non-target users, and constructing a plurality of training samples;

and training the initial Gaussian mixture model by using the plurality of training samples according to an expected maximum algorithm to obtain the Gaussian mixture model serving as a general background model.

In an embodiment, the determining whether to execute the next round of registration procedure according to the determination result includes:

and if the judgment result indicates that no significant difference exists, determining to execute a next registration process, returning to the step of acquiring the registration audio data of the target user, and constructing a training sample based on the registration audio data.

In one embodiment, the obtaining of the registered audio data of the target user and the constructing of the training sample based on the registered audio data include:

acquiring registered audio data of a current registration flow, and splicing the registered audio data of the current registration flow with all registered audio data of a historical registration flow to obtain spliced audio data; the historical registration process is a previous registration process of the current registration process;

and extracting audio features from the spliced audio data to be used as a training sample of the current registration process.

and if the judgment result indicates that the obvious difference exists, determining that the next round of registration process does not need to be executed, and taking the updated Gaussian mixture model in the current registration process as the target Gaussian mixture model corresponding to the target user.

In an embodiment, the method further comprises:

acquiring test audio data of a user to be identified, and extracting test audio features from the test audio data;

calculating a first probability value corresponding to the test audio feature according to the target Gaussian mixture model;

calculating a second probability value corresponding to the test audio characteristic according to the initially acquired general background model;

and judging whether the difference value between the first probability value and the second probability value is greater than a preset difference value threshold value, if so, determining that the user to be identified is the target user.

In an embodiment, the method further comprises:

if the difference value between the first probability value and the second probability value is not larger than the difference threshold value, determining that the user to be identified is not the target user.

On the other hand, the application provides a voiceprint recognition device with adaptive registration times, which comprises:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a Gaussian mixture model serving as a general background model;

the second acquisition module is used for acquiring registered audio data of a target user and constructing a training sample based on the registered audio data;

the updating module is used for updating the model parameters of the Gaussian mixture model by the training samples according to a maximum posterior probability estimation algorithm to obtain an updated Gaussian mixture model;

the judging module is used for judging whether the updated model parameters of the Gaussian mixture model are significantly different from the initially acquired model parameters of the general background model or not during the current registration based on a significant difference algorithm;

and the determining module is used for determining whether to execute the next registration process according to the judgment result.

Furthermore, the present application provides an electronic device comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the registration number adaptive voiceprint recognition method.

Further, the present application provides a computer-readable storage medium storing a computer program executable by a processor to perform the above-mentioned voiceprint recognition method adaptive to the number of registrations.

According to the scheme, in the process of registering on the general background model by means of the audio data of the target user, the updated model parameters and the initially acquired model parameters of the general background model can be evaluated through the significance difference algorithm, so that whether the Gaussian mixture model capable of accurately representing the voiceprint characteristics of the target user is obtained or not is fed back in real time according to the evaluation result. Therefore, the next round of registration process can be stopped after the Gaussian mixture model corresponding to the target user is obtained, so that the problem caused by too few or too many registration times is avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a voiceprint recognition method with adaptive registration times according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a training process of voiceprint features according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a voiceprint recognition method according to an embodiment of the present application;

fig. 5 is a block diagram of a voiceprint recognition apparatus with adaptive registration times according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor 11 being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10, and the memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the flow of the method in the embodiments described below. In an embodiment, the electronic device 1 may be a mobile phone, a tablet computer, a host, a server, or the like, and is configured to perform a registration number adaptive voiceprint recognition method. In an embodiment, the electronic device may be equipped with a low-power speech recognition chip, so that the voiceprint recognition method in the present scheme is executed by means of the low-power speech recognition chip.

The Memory 12 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read-Only Memory (EEPROM), erasable Programmable Read-Only Memory (EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.

The present application also provides a computer-readable storage medium storing a computer program executable by the processor 11 to perform the registration number adaptive voiceprint recognition method provided by the present application.

Referring to fig. 2, a flowchart of a voiceprint recognition method with adaptive registration times according to an embodiment of the present application is shown in fig. 2, where the method may include the following steps 210 to 250.

Step 210: and acquiring a Gaussian mixture model serving as a general background model.

The Universal Background Model (UBM) is a Gaussian Mixed Model (GMM) and is used for representing general characteristics of human voice.

The electronic device executing the scheme of the application can read out the Gaussian mixture model serving as the general background model from the specified storage position.

In one embodiment, the generic background model may be trained prior to obtaining the generic background model.

The electronic device may obtain sample audio data for a plurality of non-target users, constructing a plurality of training samples. Here, the non-target user is any speaker without limitation, and the sample audio data is audio data of the non-target user collected for training the general background model. The electronic device may extract audio features from the audio data of each non-target user, respectively, as training samples, thereby obtaining a plurality of training samples. Each training sample includes voiceprint characteristics of a corresponding speaker.

After obtaining the plurality of training samples, the electronic device may train the initial gaussian mixture model with the plurality of training samples according to an Expectation-Maximization (EM) algorithm, so as to adjust model parameters of the gaussian mixture model, thereby obtaining a trained general background model. The initial Gaussian mixture model can be a Gaussian mixture model with random model parameters; the model parameters may include several types of parameters such as weight, mean, variance, etc.

Step 220: and acquiring registered audio data of the target user, and constructing a training sample based on the registered audio data.

With the generic background model obtained, the electronic device may perform a voiceprint registration procedure for any target user. Wherein, the target user is a speaker who needs to provide voiceprint recognition service subsequently.

In each registration process, a piece of registration audio data may be acquired. The enrollment audio data is audio data provided by the target user during the enrollment process. The electronic device can acquire the registered audio data of the target user through an audio acquisition device (such as a microphone), or acquire the acquired registered audio data of the target user from other devices carrying the audio acquisition device. After obtaining the enrollment audio data, the electronic device may construct a training sample with the enrollment audio data.

Step 230: and updating the model parameters of the Gaussian mixture model by using the training sample according to the maximum posterior probability estimation algorithm to obtain the updated Gaussian mixture model.

After obtaining the training sample, the electronic device may update the model parameter of the gaussian mixture model with the training sample, and perform Maximum A Posteriori (MAP) adaptation based on the training sample, thereby completing the update of the model parameter. The primary update is performed on the gaussian mixture model as a general background model, in other words, the model parameters of the general background model are updated. And if the model parameters are updated again, performing iterative update on the updated Gaussian mixture model.

Through the updating process, the updated Gaussian mixture model can be obtained. Since the model parameters are updated by means of the training samples corresponding to the target user, the updated gaussian mixture model contains model parameters indicating the voiceprint characteristics of the target user.

Step 240: and judging whether the updated model parameters of the Gaussian mixture model are significantly different from the initially acquired model parameters of the general background model or not in the current registration based on a significant difference algorithm.

Step 250: and determining whether to execute the next round of registration process according to the judgment result.

In a round of registration process, after updating the model parameters, the electronic device may compare the updated model parameters of the general background model with the initially obtained model parameters of the general background model based on a significance difference algorithm, and determine whether a significance difference exists between the updated model parameters and the initially obtained model parameters of the general background model.

Further, it may be determined whether the registration process of the target user is completed according to the determination result, and the registration process is ended when the registration is completed, and the next round of registration process is executed when the registration is not completed.

By the measures, in the process of registering on the general background model by means of the audio data of the target user, the updated model parameters and the initially acquired model parameters of the general background model can be evaluated through the significance difference algorithm, so that whether the Gaussian mixture model capable of accurately representing the voiceprint characteristics of the target user is obtained or not is fed back in real time according to the evaluation result. Therefore, the next round of registration process can be stopped after the Gaussian mixture model corresponding to the target user is obtained, so that the problem caused by too few or too many registration times is avoided.

In an embodiment, if the determination result indicates that there is no significant difference between the model parameters of the updated model and the initially acquired model parameters of the general background model, it may be determined that the updated gaussian mixture model obtained after the current registration is not the gaussian mixture model capable of accurately representing the voiceprint features of the target user. In this case, the electronic device may determine to perform the next round of registration process, and return to step 220 to obtain new registration audio data of the target user, construct a training sample based on the newly obtained registration audio data, and continue to perform the registration processes of steps 230 to 250 after constructing the training sample.

By the measures, the next registration process can be performed under the condition that the difference between the updated model parameters and the initially acquired model parameters of the general background model is not obvious enough.

In an embodiment, in the process of executing step 220, the electronic device may obtain the registered audio data of the current registration process, and splice the registered audio data of the current registration process with all the registered audio data of the historical registration process to obtain spliced audio data. The historical registration process is a previous registration process of the current registration process.

When registration is carried out for the first time, a history registration flow does not exist, so that the registration audio data acquired for the first time is directly used as spliced audio data. In the subsequent registration process, the current registration flow can be setAnd splicing the registered audio data of the process with the spliced audio data of the previous registration process so as to splice all the registered audio data of the historical registration process. Illustratively, the first enrollment audio data of the enrollment process is denoted as N ₁ And the registered audio data of the second registration process is recorded as N ₂ And the registered audio data of the third registration process is recorded as N ₃ By analogy, the registered audio data of the mth registration process is recorded as N _m . Correspondingly, the spliced audio data of the first registration process is N ₁ The spliced audio data of the second registration procedure comprises N ₁ And N ₂ The spliced audio data of the third registration process comprises N ₁ 、N ₂ And N ₃ By analogy, therefore, the spliced audio data of the mth registration flow includes N ₁ 、N ₂ ……N _m-1 、N _m 。

The electronic device may extract audio features from the spliced audio data as training samples of the current registration process. Therefore, in each registration process, the constructed training sample may include the acquired audio features in the registered audio data of all the target users.

In an embodiment, if the determination result indicates that the model parameters of the updated model are significantly different from the initially obtained model parameters of the general background model, the electronic device may determine that a next registration process does not need to be executed, and may use the updated gaussian mixture model in the current registration process as the target gaussian mixture model corresponding to the target user. At this time, the updated gaussian mixture model in the current registration process has a significant difference from the initially acquired general background model, and can accurately represent the voiceprint characteristics of the target user.

After obtaining the target gaussian mixture model, the electronic device may provide a voiceprint recognition service for the target user by means of the target gaussian mixture model and the initially obtained general background model.

Referring to fig. 3, which is a schematic diagram of a training process of voiceprint features provided in an embodiment of the present application, as shown in fig. 3, an electronic device may first train a gaussian mixture model (GMM model) as a universal background model (UBM model) through sample audio data of a non-target user. Further, entering a registration process of the target user, the electronic device may train on the basis of the gaussian mixture model through the registered audio data of the target user, and update the model parameters to obtain an updated gaussian mixture model. The electronic device can judge whether the model parameters of the updated Gaussian mixture model are different from the model parameters of the initially acquired general background model obviously based on the significance difference algorithm. On the one hand, if the difference is significant, the updated gaussian mixture model can be used as the target gaussian mixture model corresponding to the target user. On the other hand, if the difference is not significant, a new registration process can be performed to obtain new registration audio data of the target user, so that training is continued on the basis of the updated gaussian mixture model through the registration audio data (all the obtained registration audio data) of the target user, and updating is further performed until the updated model parameters and the initially obtained model parameters of the general background model have significant differences.

In an embodiment, referring to fig. 4, which is a flowchart illustrating a voiceprint recognition method provided in an embodiment of the present application, as shown in fig. 4, after obtaining a target gaussian mixture model, an electronic device may obtain test audio data of a user to be recognized. Here, the test audio data is audio data collected when a voiceprint recognition service is provided. The electronic device may extract test audio features from the test audio data.

The electronic device may calculate the test audio feature according to the target gaussian mixture model to obtain a first probability value corresponding to the test audio feature. The electronic device may calculate the test audio feature according to the initially obtained general background model, and obtain a second probability value corresponding to the test audio feature.

The electronic device can calculate a difference between the first probability value and the second probability value and determine whether the difference is greater than a difference threshold. Here, the difference threshold may be set as needed. In one case, if the difference is greater than the difference threshold, it may be determined that the voiceprint feature of the target user in the test audio features is sufficiently significant, and thus, it may be determined that the user to be identified is the target user. In another case, if the difference is not greater than the difference threshold, it may be determined that the voiceprint characteristics of the target user in the test audio characteristics are not significant enough, and it may be determined that the user to be identified is not the target user.

According to the scheme, in the process of registering the audio data of the target user, whether a new registration process is performed or not can be determined according to the difference of the model parameters fed back in real time, so that the self-adaption of the registration times is realized, the phenomenon that the voiceprint characteristics of the target user cannot be accurately represented by a target Gaussian mixture model due to too few registration times is avoided, and the voiceprint recognition rate in the voiceprint recognition process is improved; in addition, compared with a registration mode with preset registration times in a related scheme, the method avoids excessive occupation of memory resources and power consumption resources caused by excessive registration times, and can effectively improve the application effect of a low-power-consumption voice recognition chip.

Fig. 5 is a block diagram of a voiceprint recognition apparatus with adaptive registration times according to an embodiment of the present invention, and as shown in fig. 5, the apparatus may include:

a first obtaining module 510, configured to obtain a gaussian mixture model as a general background model;

a second obtaining module 520, configured to obtain registered audio data of a target user, and construct a training sample based on the registered audio data;

an updating module 530, configured to update the model parameters of the gaussian mixture model with the training sample according to a maximum posterior probability estimation algorithm, so as to obtain an updated gaussian mixture model;

a judging module 540, configured to judge, based on a significance difference algorithm, whether a model parameter of the updated gaussian mixture model during the current registration is significantly different from a model parameter of the initially acquired general background model;

and a determining module 550, configured to determine whether to execute a next registration procedure according to the determination result.

In one embodiment, the apparatus further comprises:

a third obtaining module 560, configured to obtain sample audio data of multiple non-target users, and construct multiple training samples;

and the training module 570 is configured to train the initial gaussian mixture model with the multiple training samples according to an expected maximum algorithm, so as to obtain a gaussian mixture model serving as a general background model.

In an embodiment, the determining module 550 is further configured to:

In an embodiment, the second obtaining module 520 is further configured to:

acquiring registered audio data of a current registration flow, and splicing the registered audio data of the current registration flow with all registered audio data of a historical registration flow to obtain spliced audio data; the historical registration process is a registration process of a previous turn of the current registration process;

In an embodiment, the determining module 550 is further configured to:

In one embodiment, the apparatus further comprises:

the identification module 580 is configured to obtain test audio data of a user to be identified, and extract a test audio feature from the test audio data; calculating a first probability value corresponding to the test audio feature according to the target Gaussian mixture model; calculating a second probability value corresponding to the test audio characteristic according to the initially acquired general background model; and judging whether the difference value between the first probability value and the second probability value is greater than a preset difference value threshold value, and if so, determining the user to be identified as the target user.

In an embodiment, the identifying module 580 is further configured to:

and if the difference value between the first probability value and the second probability value is not greater than the difference threshold value, determining that the user to be identified is not the target user.

The implementation process of the functions and actions of each module in the device is specifically detailed in the implementation process of the corresponding step in the registration frequency self-adaptive voiceprint recognition method, and is not described herein again.

In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A registration frequency self-adaptive voiceprint recognition method is characterized by comprising the following steps:

acquiring a Gaussian mixture model serving as a general background model;

judging whether the model parameters of the updated Gaussian mixture model during the current registration have significant differences compared with the initially acquired model parameters of the general background model based on a significant difference algorithm;

and determining whether to execute the next registration process according to the judgment result.

2. The method of claim 1, wherein prior to said obtaining the Gaussian mixture model as the common background model, the method further comprises:

3. The method according to claim 1, wherein the determining whether to perform the next round of registration procedure according to the determination result comprises:

and if the judgment result indicates that no significant difference exists, determining to execute the next round of registration process, returning to the step of acquiring the registration audio data of the target user, and constructing a training sample based on the registration audio data.

4. The method of claim 3, wherein the obtaining registered audio data of the target user, and the constructing training samples based on the registered audio data comprises:

5. The method according to claim 1, wherein the determining whether to perform the next round of registration procedure according to the determination result comprises:

and if the judgment result indicates that the significant difference exists, determining that the next registration process is not required to be executed, and taking the updated Gaussian mixture model in the current registration process as the target Gaussian mixture model corresponding to the target user.

6. The method of claim 5, further comprising:

7. The method of claim 6, further comprising:

8. A voiceprint recognition apparatus with an adaptive number of registrations, comprising:

9. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the registration number adaptive voiceprint recognition method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the registration number adaptive voiceprint recognition method of any one of claims 1 to 7.