WO2020001182A1

WO2020001182A1 - Voiceprint recognition method, electronic device, and computer readable storage medium

Info

Publication number: WO2020001182A1
Application number: PCT/CN2019/086767
Authority: WO
Inventors: 郑能恒; 林�吉
Original assignee: 深圳大学
Priority date: 2018-06-28
Filing date: 2019-05-14
Publication date: 2020-01-02
Also published as: CN108831487B; CN108831487A

Abstract

A voiceprint recognition method, an electronic device, and a computer readable storage medium. The voiceprint recognition method comprises: acquiring voice data to be analyzed; extracting a change factor feature in the voice data; using an fallible point classifier to perform miscalculation classification on the voice data according to the change factor feature so as to obtain relative miscalculation probabilities of voice data miscalculation in K subsystems; determining the offset of the relative miscalculation probability corresponding to any subsystem with respect to the average relative miscalculation probability of the K subsystems, and calculating a final fusion weight of the corresponding subsystem according to the offset; and weighting the recognition result of each subsystem by means of the final fusion weight, and obtaining a comprehensive recognition result of the voice data according to the recognition result of each subsystem after weighting.

Description

Voiceprint recognition method, electronic device and computer-readable storage medium

The present application relates to the field of electronic technology, and in particular, to a voiceprint recognition method, an electronic device, and a computer-readable storage medium. Background technique

With the popularity of smart devices and related hardware facilities, voice interaction has become an integral part of human-computer interaction. There are more and more application scenarios related to voiceprints in voice interactions, including but not limited to: voiceprint time and attendance, software login, bank transfer and account opening verification, wake-up of virtual voice assistant, personalized interaction for different user groups Etc., these systems all use voiceprints without exception. The so-called voiceprint is the unique sound characteristic of everyone. In real life, everyone's voice when speaking has its own characteristics. Generally speaking, voiceprint recognition is divided into the following types: emotion recognition, age recognition, language recognition, gender recognition, speaker recognition, and so on.

In the prior art, in order to improve the accuracy of voiceprint recognition, multiple types of voiceprint systems are mostly used for mashups, and these systems are assigned different weights on the score domain for weighted fusion to obtain the final judgment result. For example, a fusion strategy using linear logistic regression: The central idea of this strategy is to normalize the score of each subsystem to an interval for a hybrid system with N subsystems, and then use the development set to train the fusion of each subsystem i Weights while training an overall offset!%,

Due to the complexity of the actual situation, different types of recognition subsystems in the prior art may not necessarily adapt to the initially set weights. Therefore, a fixed weight method is used to make the accuracy of voiceprint recognition low. technical problem

The embodiments of the present application provide a voiceprint recognition method, an electronic device, and a computer-readable storage medium, which are used to improve the accuracy of voiceprint recognition by setting appropriate voiceprint recognition weights. Technical solutions

A first aspect of the embodiments of the present application provides a voiceprint recognition method, including:

Get ear and voice data to be analyzed;

1

Replacement page (Article 26) Extracting a change factor feature in the voice data, where the change factor feature is used to characterize comprehensive information related to the voice data, the comprehensive information includes at least sound transmission channel information, sound environment information, and sounding object information;

Misclassifying the voice data according to the characteristics of the change factor through an error-prone classifier to obtain the relative misjudgement probability that the voice data is misjudged in the K subsystems;

Determining an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudgment probability of the K subsystems, and calculating the final fusion weight of the corresponding subsystem according to the offset;

Acquiring the recognition results of the voice data by each subsystem;

A power P weight is applied to the recognition results of the respective subsystems through the final fusion weights, and a comprehensive recognition result of the voice data is obtained according to the recognition results of the respective subsystems after weighting.

Optionally, the method for training the error-prone classifier includes:

The short-term speech data set is used as the test data set of each subsystem, and all misjudged speech segments in the test process are labeled as N different labels according to different subsystems. As a training database, N is an integer greater than zero. ;

Extracting MFCC Mel frequency cepstrum coefficient characteristics for each piece of short-term speech data in the training database;

Training a general background model according to the extracted MFCC features, training an overall change matrix; obtaining a change factor feature of the short-term speech data according to the overall change matrix;

According to the characteristics of the change factors and their corresponding labels, an error-prone classifier capable of performing N-class classification is trained.

Optionally, before training the error-prone point classifier capable of N-class classification according to the change factor feature and its corresponding label, the method includes:

Linear discriminant analysis is used to perform channel compensation on the characteristics of the change factor to obtain the characteristics of the change factor after the dimensionality reduction.

Optionally, the sum of the relative misjudgment probabilities corresponding to the K subsystems is one.

Optionally, the calculating the final fusion weight of the corresponding subsystem according to the offset includes: specifically passing the following formula:

2

Replacement page (Article 26)

The q is the final fusion weight of each of the K subsystems, where x is the input voice, and 0 ^ is the initial fusion weight of each subsystem when the input voice is%, and H is the relationship of q coefficient.

A second aspect of the embodiments of the present application provides another electronic device, including:

K subsystems and dynamic weight sub-modules, where K is an integer greater than zero;

The dynamic weighting sub-module is configured to obtain voice data to be analyzed; and extract a change factor feature in the voice data, where the change factor feature is used to characterize comprehensive information related to the voice data, where the comprehensive information includes at least a voice Transmission channel information, sound environment information, and vocalization object information; misclassified the voice data according to the change factor characteristics through a error-prone classifier, and obtained that the voice data was misjudged in the K subsystems Determine the offset of the relative misjudgment probability corresponding to any subsystem and the average relative misjudged probability of the K subsystems, and calculate the final fusion weight of the corresponding subsystem according to the offset; Weight the recognition results of the respective subsystems by using the final fusion weight, and obtain a comprehensive recognition result of the voice data according to the recognition results of the respective subsystems after weighting;

The subsystem is configured to perform preliminary voiceprint recognition on the voice data, and obtain a recognition result of the voice data.

Optionally, the dynamic weight sub-module includes: a feature extraction unit, an error-prone classifier, a weight calculation unit, and a comprehensive calculation unit;

The feature extraction unit is configured to extract a change factor feature in the voice data;

The error-prone point classifier is configured to misclassify the voice data according to the characteristics of the change factor, and obtain a relative misjudgement probability that the voice data is misjudged in the K subsystems;

The weight calculation unit is configured to determine an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudged probability of the K subsystems, and calculate the final fusion weight of the corresponding subsystem according to the offset ;

3

Replacement page (Article 26) The comprehensive calculation unit is configured to weight the recognition results of the respective subsystems by using the final fusion weight, and obtain the comprehensive recognition results of the voice data according to the recognition results of the subsystems after weighting.

Optionally, the weight calculation unit is further configured to:

The calculating the final fusion weight of the corresponding subsystem according to the offset includes:

Specifically through the following formula:

Weights,

The q is the final fusion weight of each of the K subsystems, where X is the input voice,

The P is a relationship coefficient of the q.

A third aspect of the embodiments of the present application provides another electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the computer program, To implement the voiceprint recognition method provided in the first aspect of the embodiment of the present application.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the voiceprint recognition method provided by the first aspect of the embodiments of the present application is implemented. Beneficial effect

As can be seen from the above, the scheme of this application classifies the speech segments with high error rates of each subsystem according to the characteristics of the change factors, classifies them into K-type error prone points, trains a corresponding classification model, and then classifies each piece of speech data to be analyzed , Reducing the prediction weight of the subsystem corresponding to the label obtained by the classification, and then optimizing the final result, achieving the effects of real-time evaluation and dynamic adjustment of the false positive rate of each subsystem.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1-a is a schematic flowchart of a voiceprint recognition method provided by an embodiment of the present application;

1-b is a structural diagram of a voiceprint recognition system provided by an embodiment of the present application;

4

Replacement page (Article 26) FIG. 1c is a schematic flowchart of a method for training a prone classifier according to an embodiment of the present application; FIG. 1-d is an operation flowchart of a dynamic weight sub-module provided by an embodiment of the present application;

2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a hardware structure of an electronic device according to another embodiment of the present application. Embodiments of the invention

In order to make the object, features, and advantages of the present application more obvious and easier to understand, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described The embodiments are only a part of the embodiments of this application, but not all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making creative work fall into the protection scope of this application.

Example one

An embodiment of the present application provides a voiceprint recognition method. Please refer to FIG. 1-a. The voiceprint recognition method mainly includes the following steps:

101. Acquire voice data to be analyzed.

The embodiment of the present invention is applied to a voiceprint recognition system, where the voiceprint recognition system includes K subsystems, and K is an integer greater than zero. The system architecture of the voiceprint recognition system according to the embodiment of the present invention may refer to the figure.

1-b _o

Each subsystem in the voiceprint recognition system may correspond to different types of voiceprint recognition, and the types of voiceprint recognition include: emotion recognition, age recognition, and language recognition. Further, each subsystem may also correspond to each sub-category in a recognition scenario. For example, in speech recognition, a subsystem corresponds to a language (such as Chinese, English, or French, etc.). It can be understood that, in practical applications, the correspondence between the subsystem and the voiceprint recognition category can be determined according to the actual situation, which is not specifically limited here.

In the embodiment of the present invention, the voiceprint recognition method is mainly applied to the dynamic weight sub-module in the system architecture, that is, the voice data to be analyzed can be input into the dynamic weight sub-module for weight analysis first.

102. Extract a change factor feature in the voice data.

The dynamic weighting sub-module extracts a change factor feature in the voice data, and the change factor feature is used to characterize comprehensive information related to the voice data, and the comprehensive information includes at least sound transmission channel information, sound environment information, and sounding object information. .

5

Replacement page (Article 26) Exemplarily, the feature of ivector (identity vector) in the construction of the change factor feature model is described. The ivector features a large amount of information of the speaking object, such as transmission channel information, acoustic environment information, and speaker information.

103. Misjudge and classify the voice data according to the characteristics of the change factor through an error-prone point classifier;

The dynamic weighting sub-module classifies the voice data according to the characteristics of the change factor by using the error-prone classifier, and obtains the relative misjudgment probability that the voice data is misjudged in the K subsystems.

For example, the classification result output by the error-prone classifier can be shown in the following table:

Where x is the input speech data, P _f (S _t | x) ( _/ = 1, 2, ..., is the relative misjudgment probability judged by the system 5 under the condition that the input speech is x, the higher the value It means that the probability that the voice is misjudged (including false acceptance / false rejection) under the corresponding subsystem is higher. And the sum of the relative misjudgment probabilities of all subsystems is 1, that is:

jy _f (S ^) = Ui = l, 2, ..., K)

When the probability of misjudgment for a certain voice of each subsystem is equal, the relative misjudgment probability of each subsystem is I E, which is the average relative misjudgment probability.

104. Determine an offset between a relative misjudgment probability corresponding to any subsystem and an average relative misjudged probability of the K subsystems, and calculate a final fusion weight of the corresponding subsystem according to the offset;

The dynamic weight sub-module determines an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudged probability of the K subsystems, and calculates the final fusion weight of the corresponding subsystem according to the offset.

The true significance of the relative false positive probability lies in the offset between the relative false positive probability and the average false positive probability. Relative misjudgment probability represents the relative magnitude relationship between the misjudgment probabilities of different subsystems. For example, the relative misjudgment probability of subsystem a is 0.1, and the relative misjudgment probability of subsystem b is 0.5. The significance of the relative misjudgment probability is that the misjudgment probability of subsystem b is greater than that of subsystem a, rather than the system. The false positive probability of b is 0.5.

6

Replacement page (Article 26) For example, the definition of the offset is as follows:

The higher the relative misjudgment probability of a certain subsystem for a certain speech, the greater the offset, which means that the higher the probability of the misjudgment of this speech in a certain subsystem, the fusion weight of this subsystem should be reduced at this time. . Root

The central idea is to use the offset between the relative misjudgment probability and the average probability as the calculation parameter of the fusion weight. At the same time, in order to adjust the influence of the dynamic weight sub-modules when the final probability values are fused, the weight values can be fine-tuned by adjusting the standard deviation of the weight arrays while maintaining the relative size of the weight values.

105. Obtain a recognition result of the voice data by each subsystem;

The dynamic weighting sub-module obtains the recognition result of the voice data by each subsystem.

In the embodiment of the present invention, as shown in FIG. 1-b of the system architecture, the speech data to be analyzed can be input to each subsystem for voiceprint recognition, and the recognition results of each subsystem are obtained. Among them, step 105 and step 101 are two branches that can be executed in parallel, that is, there is no strict timing relationship between step 105 and step 101, that is, before step 106, step 101 may be performed first, or step 105 may be performed first. Step 105 and step 101 may be performed at the same time, which is not specifically limited herein.

106. Weight the recognition results of the respective subsystems according to the final fusion weight, and obtain a comprehensive recognition result of the voice data according to the weighted recognition results of each subsystem.

The voiceprint recognition system weights the recognition results of the respective subsystems by using the final fusion weight, and obtains the comprehensive recognition result of the voice data according to the recognition results of the subsystems after weighting.

Exemplarily, the relative misjudgment probability (^ | x) can be calculated as each subsystem after being calculated by the following function.

Coefficient.

7

Replacement page (Article 26) Among them, for any certain group A Rlx;), the relationship between n and n satisfies the following definitions:

① The smaller the H, the smaller the standard deviation of the array C, (£ = 1,2, K);

(2) The larger | i, the larger the standard deviation of the array C _i (i = 1,2, K);

Wait.

In a general hybrid system, the number of subsystems K is generally fixed, so K here can be regarded as a constant. It can be seen that as U increases or decreases, o also increases or decreases non-linearly. Under normal circumstances, the value defaults to 0 without adjustment. If adjustment is needed, it is recommended to control the adjustment range between [-1, 1]. Too large or too small may have an adverse effect on the final fusion score result. In addition, a large adjustment may lead to a negative probability value, but it does not affect the decision process of the fusion score.

The solution of this application classifies the speech segments with high error rates of each subsystem according to the characteristics of the change factors, classifies them into K-type error prone points, and trains the corresponding classification model, and then classifies each piece of speech data to be analyzed to reduce the classification results. The prediction weights of the subsystems corresponding to the tags are optimized, and the final result is optimized to achieve the effects of real-time evaluation and dynamic adjustment of the false positive rate of each subsystem.

Example two

In the embodiment of the present invention, a fault-prone classifier needs to be constructed. Referring to FIG. 1-c, the method includes:

201. Establish a training database;

The short-term speech data set is used as the test data set of each subsystem, and all misjudged speech segments in the test process are labeled as N different labels according to different subsystems. As a training database, N is an integer greater than zero. .

202. Extract the characteristics of the M FCC Mel frequency cepstrum coefficient;

For each short-term speech data in the training database, extract a Mel frequency cepstrum coefficient

(Mel Frequency Cepstrum Coefficient, MFCC).

203. Train the overall change matrix;

Train the Universal Background Model (Replacement page) based on the extracted MFCC features (Details Article 26) UBM), training the overall change matrix T.

204. Obtain a change factor characteristic of the short-term speech data.

A change factor characteristic of the short-term speech data is obtained according to the overall change matrix T.

Exemplarily, the characteristics of the change factor can be obtained according to the following formula:

M = m + Tw;

Where m is the supervector of the background model, which is related to the acoustics and channel commonality of all speakers;

M is the mean supervector, which is obtained by adaptive training based on the supervector of the background model; T is the overall change matrix T; w is the feature factor vector of the change factor.

205. Perform dimension reduction processing on the characteristics of the change factor;

Linear discriminant analysis is used to perform channel compensation on the characteristics of the change factor to reduce the influence of redundant information such as channels in the characteristics of the change factor, and at the same time achieve the effect of reducing the dimension. Here, a linear discriminant analysis (LDA) dimensionality reduction method is used.

206. Train an error-prone classifier capable of performing N-class classification.

According to the characteristics of the change factors and their corresponding labels, an error-prone classifier capable of performing N-class classification is trained. Here, the svm classifier is used, and there are two schemes to choose from: one, a two-class svm using one vs rest strategy; two, a two-class svm using one vs one strategy.

The error-prone point classifier in the embodiment of the present invention can detect the misjudgment probability of different subsystems according to different application scenarios or voiceprint characteristics, make full use of the advantages of each subsystem and avoid high misdetection points, and then give more For proper fusion weights, the effectiveness of the hybrid system is maximized and the robustness is enhanced.

Example three

The embodiment of the present invention takes the hybrid system of language recognition as an example to describe in detail the voiceprint recognition method in the embodiment of the present invention, including:

1. For the architecture of the language identification hybrid system in the embodiment of the present invention, refer to FIG. 1-b, and each sub-system independently gives probability values of N different languages.

2. Let x be an input voice. The output of each subsystem is shown in the following table:

^ x) (/ = 1, 2, ..., A0 each subsystem independently gives a certain input voice belongs to a certain language

9

Replacement page (Article 26) Z _{j (} j = 1, 2, ..., AO probability, and the sum of all probabilities is also 1, that is:

3. The operation flow of executing the dynamic weight sub-module, please refer to Figure 1-d;

After extracting the ivector features of the speech data, the ivector features are input into the error-prone point classifier, and the classification results output by the error-prone point classifier can be shown in the following table:

Where x is the input speech data, P _f (S _t | x) ( _/ = 1, 2, ..., is the relative misjudgment probability of being misjudged by the system 5 under the condition that the input speech is x, the more the value is A high value indicates that the probability of the voice being misjudged (including false acceptance / false rejection) in the corresponding subsystem is higher. And the sum of the relative misjudgment probabilities of all subsystems is 1, that is, :

fP _f (S ₁ \ x) = Hi = \ 2, ..., K)

When the probability of misjudgment of each subsystem for a certain voice is equal, the relative misjudgment probability of each subsystem is the average relative misjudgment probability.

The true significance of the relative false positive probability lies in the offset between the relative false positive probability and the average false positive probability. The higher the relative misjudgment probability of a certain subsystem for a certain speech, the greater the offset, which means that the higher the probability of the misjudgment of this speech in a certain subsystem, the fusion weight of this subsystem should be reduced at this time . Based on the above ideas, as an example, the following calculation formula can be obtained:

The initial fusion weight for each subsystem when the input speech is ^. Central thinking

The offset between the average probabilities is used as a calculation parameter for the fusion weight. At the same time, in order to adjust the influence of the dynamic weight sub-modules when the final probability values are fused, the weight values can be fine-tuned by adjusting the standard deviation of the weight arrays while maintaining the relative size of the weight values.

In order to adjust the influence of the dynamic weight sub-modules when the final probability values are fused,

10

Replacement page (Article 26) While maintaining the relative size relationship, fine-tune the weight value by adjusting the standard deviation of the weight array.

System final fusion weights:

The q is the final fusion weight of each of the K subsystems, where x is the input voice, and 0 ^ is the initial fusion weight of each subsystem ^ when the input voice is%, and p is the q of the Relationship coefficient.

① The smaller the H, the smaller the standard deviation of the array C, (1 = 1,2, K);

③ When 11 = 0, the standard deviation of the array (i = l, 2, K) and the array (X | x) (i = l, 2, ..., K) are equal.

4. The fusion weight array c _£ ((= 1,2,, IT) is integrated into the final scoring matrix in the following form to obtain the language output by the hybrid system.

The first matrix on the left side of the above equation is a fusion weight matrix, the second matrix on the left side of the equation is a probability matrix of all languages given by the K subsystems, and the right matrix on the right side of the equation is The term matrix is assigned the fusion probability matrix after the fusion weights. Finally, add each column in the matrix on the right side of the equation to get the probability that the speech is in each language:

11

Replacement page (Article 26)

Example 4

Please refer to FIG. 2, which provides an electronic device according to an embodiment of the present application. The electronic device can be used to implement the voiceprint recognition method provided by the embodiment shown in FIG. 1-a. As shown in FIG. 2, the electronic device mainly includes:

The dynamic weighting sub-module 210 is configured to obtain voice data to be analyzed; extract a change factor feature in the voice data, and the change factor feature is used to characterize comprehensive information related to the voice data, where the comprehensive information includes The information of the sound transmission channel, the sound environment information and the sounding object information; the error data is misclassified and classified according to the characteristics of the change factor through the error-prone classifier, and the speech data is obtained in the K subsystems 220 Relative misjudgment probability of misjudgment; determining an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudgment probability of the K subsystems, and calculating the final fusion of the corresponding subsystem according to the offset Weighting; weighting the recognition results of the respective subsystems through the final fusion weights, and obtaining a comprehensive recognition result of the voice data according to the weighted recognition results of each subsystem;

The subsystem 220 is configured to perform preliminary voiceprint recognition on the voice data, and obtain a recognition result of the voice data.

The weight calculation unit is configured to determine an offset between a relative misjudgment probability corresponding to any subsystem and an average relative misjudgment probability of the K subsystems, and calculate a final value of the corresponding subsystem according to the offset.

12

Replacement page (Article 26) Fusion weights;

The comprehensive calculation unit is configured to weight the recognition results of the respective subsystems by using the final fusion weight, and obtain the comprehensive recognition results of the voice data according to the recognition results of the subsystems after weighting.

Optionally, the weight calculation unit is further configured to:

Calculate the comprehensive recognition result of the voice data according to the initial fusion weight and the following formula;

The q is the final fusion weight of each of the K subsystems, where x is the input speech, and 0 ^ is the initial fusion weight of each subsystem when the input speech is x, and H is the relationship of q coefficient.

It should be noted that, in the embodiment of the electronic device illustrated in FIG. 2 above, the division of each functional module is merely an example. In actual applications, according to requirements, such as the configuration requirements of the corresponding hardware or the convenience of software implementation, the The above function allocation is completed by different function modules, that is, the internal structure of the electronic device is divided into different function modules to complete all or part of the functions described above. Moreover, in actual applications, the corresponding functional modules in this embodiment may be implemented by corresponding hardware, or may be completed by corresponding hardware executing corresponding software. The embodiments described in this specification can apply the above-mentioned description principles, which will not be described in detail below.

For the specific process of each function module in the electronic device provided in this embodiment to implement their functions, please refer to the specific content described in the embodiment shown in FIG. 1-a above, which will not be repeated here.

Example 5

An embodiment of the present application provides an electronic device. Referring to FIG. 3, the electronic device includes: a memory 301, a processor 302, and a computer stored in the memory 301 and operable on the processor 302.

13

Replacement page (Article 26) A computer program, when the processor 302 executes the computer program, implements the voiceprint recognition method described in the foregoing embodiment shown in FIG. 1-a.

Further, the electronic device further includes:

At least one input device 303 and at least one output device 304.

The memory 301, the processor 302, the input device 303, and the output device 304 are connected through a bus 305.

The input device 303 may be a camera, a touch panel, a physical button, a mouse, or the like. The output device 304 may be a display screen.

The memory 301 may be a high-speed random access memory (RAM, Random Access Memory) memory, or may be a non-volatile memory (non-volatile memory), such as a disk memory. The memory 301 is configured to store a set of executable program code, and the processor 302 is combined with the memory 301.

Further, an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium may be provided in the electronic device in the foregoing embodiments, and the computer-readable storage medium may be the foregoing FIG. 3 The memory in the embodiment is shown. A computer program is stored on the computer-readable storage medium, and when the program is executed by a processor, the voiceprint recognition method described in the foregoing embodiment shown in FIG. 1-a is implemented. Further, the computer-storable medium may also be various media that can store program codes, such as a U disk, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a RAM, a magnetic disk, or an optical disc.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are only schematic. For example, the division of the modules is only a logical function division. In actual implementation, there may be another division manner. For example, multiple modules or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct connection or communication connection may be through some interfaces, indirect coupling or communication connection of devices or modules, and may be electrical, mechanical, or other forms.

The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on multiple network modules. Some or all of these modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

14

Replacement page (Article 26) In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist separately physically, or two or more modules may be integrated into one module. The above integrated modules can be implemented in the form of hardware or software functional modules.

When the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution may be embodied in the form of a software product, which is stored in a readable storage The medium includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The aforementioned readable storage media include: U-disks, mobile hard disks, ROMs, RAMs, magnetic disks, or optical disks, which can store program codes.

It should be noted that, for the foregoing method embodiments, for simplicity of description, they are all described as a series of action combinations, but those skilled in the art should know that this application is not limited by the described action order. Because according to the present application, certain steps may be performed in another order or simultaneously. Secondly, a person skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required for this application.

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not described in detail in one embodiment, reference may be made to the related description of other embodiments.

The above is the description of the voiceprint recognition method, the electronic device, and the computer-readable storage medium provided by the present application. For those skilled in the art, according to the ideas of the embodiments of the present application, the specific implementation and application range will be changed. In summary, the content of this specification should not be construed as a limitation on this application.

15

Replacement page (Article 26)

Claims

Claim

1. A voiceprint recognition method applied to a voiceprint recognition system, said voiceprint recognition system having K subsystems, said K being an integer greater than zero, characterized by comprising:

Get ear and voice data to be analyzed;

Extracting a change factor feature in the voice data, the change factor feature is used to represent comprehensive information related to the voice data, and the comprehensive information includes at least sound channel information, sound environment information, and sounding object information;

Misclassifying the voice data according to the characteristics of the change factor through the error-prone classifier to obtain the relative misjudgement probability that the voice data is misjudged in the K subsystems;

Determining an offset amount between the relative misjudgment probability corresponding to any subsystem and the average relative misjudgement probability of the K subsystems, and calculating the final fusion weight of the corresponding subsystem according to the offset amount;

Acquiring the recognition results of the voice data by each subsystem;

The recognition result of each subsystem is weighted by the final fusion weight, and the comprehensive recognition result of the voice data is obtained according to the recognition result of each subsystem after weighting.

2. The method according to claim 1, wherein the method for training the error-prone classifier comprises:

The short-term speech data set is used as the test data set of each subsystem, and all misjudged speech segments in the test process are labeled as N different labels according to different subsystems as a training database, where N is an integer greater than zero ;

For each short-term speech data in the training database, extract the MFCC Mel frequency cepstrum coefficient feature;

Training a general background model based on the extracted MFCC features, and training an overall change matrix; obtaining a change factor feature of the short-term speech data according to the overall change matrix; training can perform N based on the change factor feature and its corresponding label Category classification

17

Replacement page (Article 26) Error-prone classifier.

3. The method according to claim 2, wherein before training the error-prone classifier capable of N-class classification according to the characteristics of the change factor and its corresponding label, comprises:

Linear discriminant analysis is used to perform channel compensation on the characteristics of the change factor to obtain the characteristics of the change factor after dimensionality reduction.

4. The method according to claim 1, wherein the sum of the relative misjudgment probabilities corresponding to the K subsystems is one.

5. The method according to claim 1, wherein:

Calculating the final fusion weight of the corresponding subsystem according to the offset includes: calculating the initial fusion weight of the corresponding subsystem according to the offset, specifically through the following:

According to the initial fusion weight, and calculate the final fusion weight by the following formula; Q = Qs _{t, x} + ^ [Qs,, x-go] ( ^Z = 1, ² ,,)

The q is the final fusion weight of each of the K subsystems, where x is the output

The initial fusion weight of each subsystem at x, where the relationship coefficient is described.

6. —A voiceprint recognition system, characterized in that it includes:

K subsystems and dynamic weight sub-modules, where K is an integer greater than zero; the dynamic weight sub-modules are used to obtain speech data to be analyzed; extracting change factor characteristics, the change factor characteristics in the speech data It is used to characterize comprehensive information related to the voice data, where the comprehensive information includes at least sound transmission channel information, sound environment information, and vocalization object information; the error data is processed by the error-prone classifier according to the change factor characteristics. Misjudgment classification to obtain the voice data in the K subsystems

18

Replacement page (Article 26) The relative misjudgment probability of being misjudged in the system; determining the offset between the relative misjudgement probability corresponding to any subsystem and the average relative misjudgement probability of the K subsystems, and calculating the corresponding subsystem's A final fusion weight; weighting the recognition results of the respective subsystems by the final fusion weights, and obtaining a comprehensive recognition result of the voice data according to the recognition results of the subsystems after weighting;

The subsystem is configured to perform preliminary voiceprint recognition on the voice data and obtain a recognition result of the voice data.

7. The system according to claim 6, wherein the dynamic weight sub-module comprises: a feature extraction unit, an error-prone classifier, a weight calculation unit, and a comprehensive calculation unit;

The feature extraction unit is configured to extract a change factor feature in the voice data; the error-prone point classifier is configured to misclassify the voice data according to the change factor feature, and obtain the voice data in the voice data. Relative misjudgment probability of being misjudged in K subsystems;

The comprehensive calculation unit is configured to weight the recognition results of the respective subsystems by using the final fusion weight, and obtain the comprehensive recognition results of the voice data according to the recognition results of the subsystems after the weighting.

The system according to claim 6, wherein the weight calculation unit is further configured to:

5 _# initial fusion weight,

19

Replacement page (Article 26) 丄-A ($ | x) represents the offset;

K

;

Let H be the relation coefficient.

9. An electronic device comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that when the processor executes the computer program, rights are realized The method according to any one of claims 1 to 4.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that when the computer program is executed by a processor, the method according to any one of claims 1 to 4 is implemented.

20

Replacement page (Article 26)