CN113421573B

CN113421573B - Identity recognition model training method, identity recognition method and device

Info

Publication number: CN113421573B
Application number: CN202110681339.2A
Authority: CN
Inventors: 孟庆林; 蒋宁; 吴海英; 王洪斌; 刘敏; 陈燕丽
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2024-03-19
Anticipated expiration: 2041-06-18
Also published as: CN113421573A

Abstract

The embodiment of the application provides an identity recognition model training method, an identity recognition method and an identity recognition device, wherein the identity recognition model training method comprises the following steps: acquiring a training audio data set, extracting features of the training audio data set to obtain a training feature set, inputting the training feature set into a content recognition model included in a model to be trained for iterative training, and inputting the training feature set into the content recognition model after training is completed to output a content vector; and inputting the training feature set into a voiceprint recognition model included in the model to be trained for iterative training, inputting the training feature set into the voiceprint recognition model after training is completed to output a voiceprint vector, inputting the content vector and the voiceprint vector into a classifier included in the model to be trained for iterative training until the likelihood of the classifier is maximum, and converging parameters to obtain the identity recognition model. By adopting the embodiment of the application, the accuracy of the identity recognition result can be improved.

Description

Identity recognition model training method, identity recognition method and device

Technical Field

The embodiment of the application relates to the technical field of machine learning, in particular to an identity recognition model training method, an identity recognition method and an identity recognition device.

Background

With the rapid development of science and technology, more and more manual repeated mechanical labor is gradually replaced by artificial intelligence, and machine learning is an important branch of the artificial intelligence, and has been widely applied to various fields of machine translation, artificial intelligence customer service, text detection, voiceprint awakening and the like.

In the voiceprint awakening field, whether the user to be identified is a pre-registered user or not can be confirmed through identity identification, and then operations such as terminal equipment awakening or application program awakening deployed in the terminal equipment can be performed according to an identity identification result. In the prior art, relevant voiceprint recognition algorithm can be adopted to recognize voiceprint information, so that an identity recognition result is obtained.

However, in the existing voiceprint recognition algorithm, the processing procedure generally includes that an acquired text-related voiceprint vector sequence is compressed to obtain a voiceprint codebook set, then based on voiceprint information of a user received in a text-related voiceprint recognition scene, the Euclidean distance between the voiceprint information and the voiceprint codebook set is determined, and an identity recognition result of the user is determined according to the Euclidean distance, and the identity recognition result is determined only through a single dimension of the voiceprint information of the user, so that the recognition accuracy of the identity recognition result is reduced, and further, the situation that terminal equipment or related application programs wake up by mistake may be caused.

Disclosure of Invention

The embodiment of the application provides an identity recognition model training method, an identity recognition method and an identity recognition device, so as to improve the recognition accuracy of an identity recognition result.

In a first aspect, an embodiment of the present application provides a method for training an identification model, where the method includes:

acquiring a training audio data set, and extracting features of the training audio data set to obtain a training feature set;

inputting the training feature set into a content recognition model included in a model to be trained for iterative training, and inputting the training feature set into the content recognition model after training is completed to output a content vector; inputting the training feature set into a voiceprint recognition model included in the model to be trained for iterative training, and inputting the training feature set into the voiceprint recognition model after training is completed to output a voiceprint vector;

and inputting the content vector and the voiceprint vector into a classifier included in the model to be trained for iterative training until the likelihood of the classifier is maximum and parameters are converged, so as to obtain an identity recognition model.

It can be seen that in the embodiment of the application, the trained identity recognition model comprises a content recognition model and a voiceprint recognition model, so that two dimensional information of the content and the voiceprint are considered simultaneously, and the accuracy of subsequent identity recognition is improved. In addition, the identification model further comprises a classifier, the data used for training the classifier are extracted through the content identification model and the voiceprint identification model which are completed through training, and the accuracy of the data extracted by the model which is completed through training is high, so that the accuracy of the classifier obtained through training by using the data is improved, the accuracy of the identification model is further improved, and the accuracy of subsequent identification is further improved finally.

In a second aspect, an embodiment of the present application provides an identification method, where the method includes:

acquiring first voice data of a user to be identified;

inputting the first voice data into a content recognition model included in an identity recognition model, and outputting a target content vector; inputting the first voice data into a voiceprint recognition model included in the identity recognition model, and outputting a target voiceprint vector;

inputting the target content vector, the target voiceprint vector, the preset content vector and the preset voiceprint vector into a classifier included in the identity recognition model, and outputting likelihood distribution values; the preset voice print vector is obtained by inputting second voice data of the target user into the voice print recognition model;

and under the condition that the likelihood distribution value is larger than a preset likelihood distribution value threshold value, determining that the user to be identified and the target user are the same user.

It can be seen that in the embodiment of the application, the content information and the voiceprint information are comprehensively considered in the identification, so that the consideration factors of the identification are increased, the identification is not performed only by relying on single information, and the accuracy of the identification is improved.

In a third aspect, an embodiment of the present application provides an identification model training apparatus, where the apparatus includes:

the first acquisition module is used for acquiring a training audio data set and extracting features of the training audio data set to obtain a training feature set;

the first processing module is used for inputting the training feature set into a content recognition model included in a model to be trained to carry out iterative training, and inputting the training feature set into the content recognition model after training is completed to output a content vector; inputting the training feature set into a voiceprint recognition model included in the model to be trained for iterative training, and inputting the training feature set into the voiceprint recognition model after training is completed to output a voiceprint vector;

the first processing module is further configured to input the content vector and the voiceprint vector into a classifier included in the model to be trained for iterative training until likelihood of the classifier is maximum, and parameters converge to obtain an identity recognition model.

In a fourth aspect, an embodiment of the present application provides an identification device, where the device includes:

the second acquisition module is used for acquiring first voice data of the user to be identified;

The second processing module is used for inputting the first voice data into a content recognition model included in the identity recognition model and outputting a target content vector; inputting the first voice data into a voiceprint recognition model included in the identity recognition model, and outputting a target voiceprint vector;

the second processing module is further configured to input the target content vector, the target voiceprint vector, the preset content vector and the preset voiceprint vector into a classifier included in the identity recognition model, and output likelihood distribution values; the preset voice print vector is obtained by inputting second voice data of the target user into the voice print recognition model;

the second processing module is further configured to determine that the user to be identified and the target user are the same user when the likelihood distribution value is greater than a preset likelihood distribution value threshold.

In a fifth aspect, embodiments of the present application provide an electronic device, including: at least one processor and memory;

the memory stores computer-executable instructions;

The at least one processor executing computer-executable instructions stored in the memory causes the at least one processor to perform the identity model training method of any one of the first aspect, or the identity method of any one of the second aspect.

In a sixth aspect, an embodiment of the present application provides a computer readable storage medium, where computer executable instructions are stored, and when executed by a processor, implement the method for training an identification model according to any one of the first aspect or the method for identifying according to any one of the second aspect.

In a seventh aspect, embodiments of the present application provide a computer program product, including a computer program, which when executed by a processor implements the method for training an identification model according to the first aspect and the various possible designs of the first aspect, or the method for identifying an identification according to any one of the second aspects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment of an identification model training method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of an identification model training method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an architecture of an identification model according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of an identification model training method according to another embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an identification model training process according to an embodiment of the present disclosure;

fig. 6 is a schematic flow chart of an identification method provided in an embodiment of the present application;

fig. 7 is a visual scene diagram of an identification method provided in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an apparatus for training an identification model according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an identification device provided in an embodiment of the present application;

fig. 10 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the present application described herein may be capable of including other sequential examples in addition to those illustrated or described. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Along with the development of scientific technology, artificial intelligence has been widely applied to various fields such as machine translation, artificial intelligence customer service, text detection, voiceprint wake-up, and the like. For the voiceprint awakening field, whether the user to be identified is a pre-registered user or not can be confirmed through identity identification, and then operations such as terminal equipment awakening or application program awakening deployed in the terminal equipment can be performed according to an identity identification result. For example, in some application scenarios, a user may need to wake up a terminal device or an application deployed in the terminal device by means of voiceprint wake-up due to two hands being occupied or being far away, etc. For example, the user needs to start the payment application program a in the shopping process, but the user holds the shopping bag with both hands, so that the payment application program a is inconvenient to start by touching the terminal device, and therefore, the payment application program a can be started to make payment by a voiceprint wake-up mode.

In the prior art, relevant voiceprint recognition algorithm can be adopted to recognize voiceprint information to obtain an identity recognition result, and then whether terminal equipment or relevant application programs need to be awakened is determined according to the identity recognition result. However, the existing voiceprint recognition algorithm generally comprises the steps of firstly compressing an acquired text-related voiceprint vector sequence to obtain a voiceprint codebook set, then determining the Euclidean distance between the voiceprint information and the voiceprint codebook set based on voiceprint information of a user received in a text-related voiceprint recognition scene, and determining the identity recognition result of the user according to the Euclidean distance.

For example, continuing to use the foregoing example, when the user initiates the payment application a to make payment by means of voiceprint wake-up, for the payment application, since the actual transaction is involved, the security requirement for wake-up initiation is high, that is, the identity recognition result of the user needs to be determined first, and the identity recognition result can be initiated after passing. The existing voiceprint recognition algorithm can obtain an identity recognition result through a voiceprint recognition mode so as to further determine identity information of a user, and another problem occurs at the moment, if a plurality of application programs or a plurality of payment application programs are deployed in the terminal equipment, the terminal equipment only determines that the user is a preset user through the voiceprint recognition mode, can start the application programs, but cannot determine which application program is specifically started. In addition, an application program may be started according to a preset starting sequence, but the application program may not be a payment application program, and the payment function cannot be completed, or may not be a specific payment application program which the user wants to start, and there may be a problem that the application program cannot be started accurately in other application scenes, namely, the identity recognition result is determined only by the voiceprint information of the user, so that the recognition accuracy of the identity recognition result is reduced, and a situation of false awakening may exist, thereby influencing the use experience of the user.

Based on the problems, the identity recognition model is obtained through training in a mode of combining voiceprint information and content information in the voice training set, and then the voiceprint information and the content information in the voice information are comprehensively recognized according to the trained identity recognition model, so that judgment is not only performed by relying on the voiceprint information, the recognition accuracy of an identity recognition result is improved, and the technical effect of false awakening of subsequent terminal equipment or related application programs is reduced.

Fig. 1 is a schematic diagram of an implementation environment of a training method of an identification model according to an embodiment of the present application, as shown in fig. 1, where the implementation environment provided by this embodiment may mainly include: a server 101 and a terminal device 102, the terminal device 102 communicates with the server 101 by wireless or wired means. The wired mode may be data transmission between the terminal device 102 and the server 101 through a line such as a high-definition multimedia interface (High Definition Multimedia Interface, HDMI), and the wireless mode may be communication between the terminal device 102 and the server 101 through a bluetooth, WIFI, or the like.

Furthermore, the implementation environment of the present embodiment may further include a database 103, where the training audio data set is stored in the database 103. In one implementation, as shown in FIG. 1, server 101 may obtain a training audio data set from database 103, and then perform model training based on the obtained training audio data set, thereby obtaining an identification model. After the training of the identity recognition model is completed, the identity recognition model can be deployed in the terminal device 102, the terminal device 102 can recognize the identity information of the user according to the identity recognition model to obtain an identity information recognition result, and then the wake-up operation of the terminal device 102 or an application program deployed in the terminal device 102 can be realized according to the identity information recognition result.

In another implementation manner, the terminal device 102 may also directly obtain a training audio data set from the database 103, and then perform model training according to the obtained training audio data set, so as to obtain an identification model. After the training of the identity recognition model is completed, the terminal device 102 can recognize the identity information of the user according to the identity recognition model to obtain an identity information recognition result, and then the wake-up operation of the terminal device 102 or the application program deployed in the terminal device 102 can be realized according to the identity information recognition result.

It should be noted that, the terminal device 102 may be, but is not limited to, an intelligent interaction device such as a smart phone, a tablet, a personal computer, an intelligent home appliance (for example, a water heater, a washing machine, a television, an intelligent sound box, etc.), an intelligent wearable device, etc.

In addition, the server 101 may be a server deployed independently, or may be a cluster server.

The method provided by the application can be widely applied to different application scenes related to the voiceprint wake-up function, and the identity recognition model training method and the implementation process of the identity recognition method provided by the application are described in detail below in combination with specific application scenes.

The technical scheme of the present application is described in detail below with specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 2 is a flow chart of an identification model training method provided in the embodiment of the present application, where the method of the embodiment may be executed by the server 101 or the terminal device 102, as shown in fig. 2, and the method of the embodiment may include:

s201: and acquiring a training audio data set, and extracting features of the training audio data set to obtain a training feature set.

In this embodiment, the training audio data set may include a plurality of training audio data, where each training audio data may be generated by the same user, may be generated by different users, or may be generated by the same user for a part of the training audio data.

In addition, when the training audio data set is acquired, a pre-stored training audio data set can be directly acquired from the database, and the training audio data set can also be acquired from the third-party training audio generation system, and of course, the training audio data set acquisition modes of other modes are also within the protection scope of the application, and are not limited in detail herein.

The user can be represented by voiceprint information of the user, meaning expressed by the user is represented by content information extracted from the audio data, so that a training feature set obtained after feature extraction of the training audio data set can comprise voiceprint features and content features of the user, and the voiceprint features and the content features can be represented in a vector manner, so that voiceprint vectors corresponding to the voiceprint features and content vectors corresponding to the content features can be obtained.

Further, the training feature set is obtained by extracting the training audio data set by using MFCC (Mel-Frequency Cepstral Coefficients, mel-frequency cepstral coefficient).

Specifically, mel frequency is proposed based on auditory characteristics of human ears, the mel frequency and the Hz frequency form a nonlinear corresponding relationship, MFCC uses the relationship between the mel frequency and the MFCC to calculate Hz spectrum characteristics, and the obtained Hz spectrum characteristics are mainly used for extracting voice data characteristics, and for example, 80-dimensional voiceprint characteristics and content characteristics can be extracted through MFCC. By adopting the mode of extracting the characteristics by the MFCC, the method is a general knowledge of the industry in the aspect of audio extraction, and not only improves the convenience of the characteristic extraction, but also improves the accuracy of the characteristic extraction.

In addition, the training audio data set may also be feature extracted using existing kits such as kaldi, espnet, or librosa.

In addition, the training audio data set contains target training audio, the target training audio contains preset wake-up words, and the specific gravity of the target training audio in the training audio data set is greater than or equal to a preset specific gravity threshold.

Specifically, in order to make the content features of the training audio data set more obvious, the number of target training audio including the preset wake-up word may be increased, so that the content features (i.e., phoneme features) corresponding to the preset wake-up word are more obvious. Correspondingly, the specific gravity of the target training audio in the training audio data set can be set to be greater than or equal to a preset specific gravity threshold. The specific gravity threshold may be set according to the actual application scenario, which will not be discussed in detail herein.

For example, the wake word may be "small Ma Xiaoma", and the training audio data set needs to include target training audio with "small Ma Xiaoma" in speech and target training audio with "small Ma Xiaoma" in speech, where the specific gravity of the target training audio in the training audio data set is greater than or equal to the specific gravity threshold.

By increasing the number of target training audios containing preset wake-up words, the content feature distribution corresponding to the wake-up words is more concentrated, the training duration of the identity recognition model is shortened, and the training efficiency of the identity recognition model is improved.

S202: inputting the training feature set into a content recognition model included in the model to be trained for iterative training, and inputting the training feature set into the content recognition model after training is completed to output a content vector; and inputting the training feature set into a voiceprint recognition model included in the model to be trained for iterative training, and inputting the training feature set into the voiceprint recognition model after training is completed to output a voiceprint vector.

In this embodiment, after the training feature set is obtained, the training feature set contains voiceprint features and content features of several users. In the prior art, after the training feature set is obtained, the content features in the training feature set may be directly ignored, but the voiceprint features of the user are extracted, then the relevant voiceprint recognition algorithm is trained according to the voiceprint features of the user, and the user identity is recognized according to the voiceprint recognition algorithm. However, a single feature does not fully generalize the speech information. In the voice wake-up scenario of the terminal device, a specific user is required to speak a specific keyword or a specific sentence to wake up the terminal device. For example, the user B needs to speak the keyword "small Ma Xiaoma" to wake up the terminal device, however, the existing voiceprint recognition algorithm can only recognize the user B and cannot recognize the keyword or the specific sentence, if the user B expresses other keywords or specific sentences, the user B may wake up the terminal device, so that a false wake up situation occurs, and the use experience of the user is reduced.

In order to reduce false wake-up conditions, the model to be trained can be trained, wherein the model to be trained can comprise a content recognition model, a voiceprint recognition model and a classifier, and the identity recognition model can be obtained by training the content recognition model, the voiceprint recognition model and the classifier of the model to be trained.

Further, a voiceprint recognition model for extracting voiceprint features and a content recognition model for extracting content features may be trained by training the feature set. The content recognition model after training outputs a content vector, and the voiceprint recognition model after training outputs a voiceprint vector, namely, the voice information can be expressed through the combination of two dimensions of the content feature and the voiceprint feature.

The training process of the content recognition model and the voiceprint recognition model and the feature extraction process are not particularly sequential. The content recognition model can be trained firstly, and then the voiceprint recognition model can be trained; or training the voiceprint recognition model first and then training the content recognition model; the content recognition model and the voiceprint recognition model can also be trained simultaneously. The feature extraction process is similar and will not be discussed in detail here.

The content vector and the voiceprint vector may be one-dimensional vectors, and the extracted content vector may be a phoneme vector, that is, the content vector may be phoneme information corresponding to each training audio (that is, each training keyword or training sentence) in the training feature set. Specifically, the training keyword may be a single noun (e.g., small Ma Xiaoma), and the training sentence may be a sentence in a format other than a single name (e.g., turning on a television). For example, the content vector may be a (phoneme information corresponding to a small Ma Xiaoma, phoneme information corresponding to a calf, phoneme information corresponding to a chicken, phoneme information corresponding to a small Ma Xiaoma, phoneme information corresponding to a small Ma Xiaoma, phoneme information corresponding to a small cat, phoneme information corresponding to a small Ma Xiaoma, phoneme information corresponding to a chicken cat), and the voiceprint vector may be b (voiceprint corresponding to user 1, voiceprint corresponding to user 2, voiceprint corresponding to user 3, voiceprint corresponding to user 2, voiceprint corresponding to user 1, voiceprint corresponding to user 3).

S203: and inputting the content vector and the voiceprint vector into a classifier included in the model to be trained for iterative training until the likelihood of the classifier is maximum and the parameters are converged, so as to obtain the identity recognition model.

In this embodiment, after the training of the content recognition model and the voiceprint recognition model included in the model to be trained is completed, the content vector output by the content recognition model and the voiceprint vector output by the voiceprint recognition model may be obtained, and then the classifier included in the model to be trained may be iteratively trained according to the content vector and the voiceprint vector until the likelihood of the classifier is maximum, and the parameters converge, so that the identity recognition model may be obtained. The classifier may be a PLDA (Probabilistic Linear Discriminant Analysis ) model, among others. In the iterative training process of the classifier, training can be performed in a mode of maximum likelihood estimation until the likelihood of the classifier is maximum and parameters are converged, and then the identity recognition model can be obtained. Correspondingly, the likelihood distribution value obtained through the parameter is the maximum value estimated by the classifier.

Fig. 3 is a schematic architecture diagram of an identification model provided in an embodiment of the present application, as shown in fig. 3, in this embodiment, a voiceprint identification model may include a content identification model for extracting a content vector, a voiceprint identification model for extracting a voiceprint vector, and a classifier for obtaining likelihood distribution values according to the content vector and the voiceprint vector, where output ends of the content identification model and the voiceprint identification model are connected to an input end of the classifier.

The examples of the present specification also provide some specific embodiments of the method based on the method of fig. 2, which is described below.

In another embodiment, the content recognition model may be Conformer Network (Convolvulation-augmented Transformer for Speech Recognition Network, based on a Convolution enhanced phoneme recognition model), the content vector is a phoneme vector, and the phoneme recognition model may extract phoneme features in the voice information, and then take a last layer of output vectors in the extracted phoneme features as the phoneme vector.

In addition, because the factor characteristics can more truly reflect the content information of the voice, compared with a voiceprint recognition algorithm obtained through keyword characteristic training, the meaning of the voice information can be more accurately determined through an identity recognition model obtained through phoneme characteristic training, the accuracy of content recognition is improved, and the accuracy of identity recognition is further improved.

Further, if the content recognition model is a phoneme recognition model based on convolution enhancement, the specific implementation process of inputting the training feature set into the content recognition model included in the model to be trained to perform iterative training may include:

and inputting the training feature set into the phoneme recognition model for iterative training, and determining a first loss value of the phoneme recognition model according to a preset first gradient descent algorithm.

And under the condition that the first loss value is greater than or equal to the first loss value threshold value, training of the phoneme recognition model is completed.

In particular, the first gradient descent algorithm may employ existing algorithms, which will not be discussed in detail herein. The first loss value threshold may be determined according to factors such as a degree of convergence of the phoneme recognition model or a test accuracy. For example, the first loss value threshold may be set to a test accuracy of 97% or 98%.

In addition, the voiceprint recognition model can be ResNet TDNN (Residual Neural Network and Time Delay Neural Network, a residual network and a time delay neural network), the residual network and the time delay neural network are one of deep learning convolutional neural networks, the introduction of a dynamic routing layer can be realized, the number of network layers is deepened and trainable, network parameters are greatly reduced, network performance is improved, network efficiency is effectively improved, and the network can better capture audio time sequence information by accessing the TDNN layer behind ResNet, so that the accuracy of voiceprint recognition of a user is improved.

Correspondingly, if the voiceprint recognition model is a residual network and a time delay neural network, the specific implementation process of inputting the training feature set into the voiceprint recognition model included in the model to be trained to perform iterative training may include:

and inputting the training feature set into a residual error network and a time delay neural network for iterative training, and determining a second loss value of the residual error network and the time delay neural network according to a preset second gradient descent algorithm.

And under the condition that the second loss value is larger than or equal to the second loss value threshold value, training the residual error network and the time delay neural network is completed.

Specifically, the second gradient descent algorithm may adopt an existing algorithm, which is not discussed in detail herein, and the first gradient descent algorithm and the second gradient descent algorithm may adopt the same algorithm or may adopt different algorithms, and only the corresponding functions need to be realized.

Similarly, the second loss value threshold may be determined according to the convergence degree of the residual network and the time delay neural network or the test accuracy. For example, the second loss value threshold may be set to a test accuracy of 97% or 98%.

Fig. 4 is a flow chart of an identification model training method according to another embodiment of the present application, as shown in fig. 4, in this embodiment, S203 may specifically include:

s401: the content vector and the voiceprint vector are connected in parallel to form a one-dimensional voice training vector.

In this embodiment, the content vector may be a vector representing voice content, and the voiceprint vector may be a vector representing a user, and before the classifier is iteratively trained by the content vector and the voiceprint vector, the content vector and the voiceprint vector may be connected in parallel to form a one-dimensional voice training vector, and then the classifier is trained.

Illustratively, the content vector may be a (phoneme information corresponding to the calf, phoneme information corresponding to the chicken, phoneme information corresponding to the small Ma Xiaoma, phoneme information corresponding to the small Ma Xiaoma, phoneme information corresponding to the small cat, phoneme information corresponding to the small Ma Xiaoma, phoneme information corresponding to the chicken cat), the voiceprint vector may be b (voiceprint corresponding to user 1, voiceprint corresponding to user 2, voiceprint corresponding to user 3, voiceprint corresponding to user 2, voiceprint corresponding to user 3, voiceprint corresponding to user 3), the speech training vector may be c (phoneme information corresponding to calves, phoneme information corresponding to chicken, phoneme information corresponding to calves Ma Xiaoma, phoneme information corresponding to calves Ma Xiaoma, phoneme information corresponding to calves, phoneme information corresponding to the small Ma Xiaoma, phoneme information corresponding to the chicken cat, voiceprint corresponding to the user 1, voiceprint corresponding to the user 2, voiceprint corresponding to the user 3, voiceprint corresponding to the user 2, voiceprint corresponding to the user 1, voiceprint corresponding to the user 3).

The maximum length of each vector may be set according to the actual application, which will not be discussed in detail herein.

The classifier is trained by connecting the content vector and the voiceprint vector in parallel to form a one-dimensional voice training vector, so that the vectors input in the same batch contain the content vector and the voiceprint vector, the variety of the features in the voice training vector is increased, the training dimension of the voice training vector is richer, and the classifier can learn the features corresponding to the training feature set more accurately.

S402: the speech training vector is input into a classifier for iterative training.

In this embodiment, the classifier may be a PLDA model, and after obtaining the speech training vector, the PLDA model may be iteratively trained by the speech training vector.

Further, the specific implementation process of S402 may include:

and respectively carrying out mean value processing and initialization processing on the classifier according to the voice training vector to obtain an initial maximum likelihood estimation expression.

And inputting the voice training vector and the initial maximum likelihood estimation expression into a classifier for iterative training.

Specifically, the voice training vector includes different voice contents corresponding to different users, and the classifier can be iteratively trained according to the different voice contents corresponding to different users. Correspondingly, average value processing and initialization processing can be respectively carried out on the PLDA model to obtain an initial maximum likelihood estimation expression.

Further, the parameters included in the initial maximum likelihood estimation expression may be a mean value of the overall speech training vector, an identity spatial feature matrix, a noise spatial feature matrix, and a noise covariance. The identity space feature matrix can represent information of different users; the noise spatial feature matrix may represent information of different speech variations of the same user and the final residual noise term, with noise covariance being used to represent what has not yet been explained. And then inputting the voice training vector and the initial maximum likelihood estimation expression into a classifier, and carrying out iterative training through a preset algorithm to determine the numerical value corresponding to each parameter in the initial maximum likelihood estimation expression. By way of example, the preset algorithm may be an EM (Expectation Maximization Algorithm, desired maximization) algorithm.

The training of the EM algorithm is to estimate the numerical value of the parameter according to the given observation data through maximum likelihood estimation, then estimate missing data according to the numerical value of the parameter estimated in the previous step, then estimate the parameter again according to the estimated missing data plus the previously acquired data to obtain the numerical value of the new parameter, and then iterate repeatedly until the likelihood of the classifier is maximum and the parameter converges.

After the scheme is adopted, because likelihood distributions of different voice features passing through the PLDA model are different, user identity recognition can be performed according to the different likelihood distributions, and because the PLDA model has higher channel compensation capability, the features of voice information can be represented to the greatest extent, the accuracy of fitting the features to represent parameters to express the corresponding features of the voice information is improved, and further the accuracy of identity recognition is improved.

Furthermore, in another embodiment, the identification model may further include: and the input ends of the content recognition model and the voiceprint recognition model are connected with the output end of the feature extraction module. Correspondingly, feature extraction is performed on the training audio data set to obtain a training feature set, which specifically may include:

and carrying out feature extraction on the training audio data set through a feature extraction module to obtain a training feature set.

In this embodiment, the feature extraction module may perform feature extraction on the training audio data set by using existing manners such as kaldi, espnet or librosa, so as to improve efficiency and accuracy of feature extraction.

Fig. 5 is a schematic diagram of an identity recognition model training process provided in the embodiment of the present application, as shown in fig. 5, where in the embodiment, the content recognition model is a phoneme recognition model, a training audio data set may be acquired first, then the training audio data set is input to a feature extraction module to perform feature extraction, a training feature set is obtained, then the training feature set is input to a phoneme recognition model and a voiceprint recognition model included in a model to be trained respectively to perform training, after the training is completed, a voiceprint vector is obtained through the voiceprint recognition model completed by training, and a phoneme vector is obtained through the phoneme recognition model completed by training. After obtaining the phoneme vector and the voiceprint vector, the phoneme vector and the voiceprint vector can be connected in parallel to form a one-dimensional voice training vector, and then the PLDA model is subjected to iterative training according to the voice training vector until the likelihood of the PLDA model is maximum, and parameters are converged to obtain the identity recognition model.

The identity recognition model obtained by the method for training the identity recognition model in the foregoing embodiments may be applied to the field of voice wake-up, and after the identity recognition model is obtained, the identity recognition model may be deployed in a terminal device, so as to wake up the terminal device or an application deployed in the terminal device. The technical solutions of the present application are described in detail below with specific embodiments, and the following specific embodiments may be combined with each other, and may not be repeated in some embodiments for the same or similar concepts or processes.

Fig. 6 is a flow chart of an identification method provided in the embodiment of the present application, as shown in fig. 6, in this embodiment, an identification model obtained by using the identification model training method described in the foregoing embodiments may be applied, and the method of this embodiment may be performed by the terminal device 102, as shown in fig. 6, and the method of this embodiment may include:

s601: and acquiring first voice data of the user to be identified.

In this embodiment, in some application scenarios, the user wants to wake up the terminal device or the application deployed in the terminal device remotely, or the user cannot wake up the terminal device or the application deployed in the terminal device by touching the terminal device at this time. In an application scenario, a user wants to turn on the water heater to heat, and then does not want to turn on the water heater manually, so the water heater can be turned on remotely to heat by voice wake-up. Under another application scene, the user cannot find the position of the smart phone at the moment, and no other smart terminal equipment in the hand communicates with the smart phone, so that the smart phone can be awakened in a voice awakening mode, and the position of the smart phone is determined. In another application scenario, the user wants to open the payment application program to pay in the shopping process, and then the user holds the shopping bag at the two hands, so that the payment application program is inconvenient to open in a touch terminal device mode to pay, and therefore the payment application program can be opened in a voice awakening mode to pay.

In the application scenario, the terminal device may first acquire the first voice data, then identify the identity information according to the first voice data, and then determine whether to wake up the terminal device or an application deployed in the terminal device according to the identity information identification result.

Further, the first voice data may include content information and voiceprint information of the user, and illustratively, the first voice data may be: the wake-up keyword of user a utters is small Ma Xiaoma, the wake-up keyword of user B utters is small cat marquee, or the wake-up keyword of user C utters is small Ma Xiaoma.

S602: inputting the first voice data into a content recognition model included in the identity recognition model, and outputting a target content vector; and inputting the first voice data into a voiceprint recognition model included in the identity recognition model, and outputting a target voiceprint vector.

In this embodiment, after the first voice data is obtained, the first voice data may be input into a content recognition model included in the identity recognition model to perform recognition to obtain a target content vector, and may also be input into a voiceprint recognition model included in the identity recognition model to perform recognition to obtain a target voiceprint vector. The content recognition model may be a phoneme recognition model, and the obtained target content vector is a target phoneme vector.

In addition, the first voice data may be input to the content recognition model to perform recognition, or the first voice data may be input to the voiceprint recognition model to perform recognition, or the first voice data may be input to the content recognition model and the voiceprint recognition model to perform recognition at the same time, which is not particularly limited herein.

S603: and inputting the target content vector, the target voiceprint vector, the preset content vector and the preset voiceprint vector into a classifier included in the identity recognition model, and outputting likelihood distribution values. The preset voice print vector is obtained by inputting the second voice data of the target user into the voice print recognition model.

In this embodiment, before the first voice data is acquired, the second voice data of the target user may be acquired first, then the second voice data is input to the content recognition model to obtain the preset content vector, and the second voice data is input to the voiceprint recognition model to obtain the preset voiceprint vector. Correspondingly, the target user can be a user which is preset and registered and can wake up the terminal equipment or related application programs, namely, a preset content vector and a preset voiceprint vector which are obtained according to the second voice data of the target user can be used as the basis for judging the user identity information.

After registration is completed, whether the first voice data are generated by a user can be detected in real time, after the first voice data are obtained, a target content vector and a target voiceprint vector corresponding to the first voice data can be determined, then the target content vector, the target voiceprint vector, a preset content vector and the preset voiceprint vector can be input into a classifier included in the identity recognition model, and likelihood distribution values are output. Wherein the likelihood distribution value may represent a similarity between the first speech data and the second speech data.

S604: and under the condition that the likelihood distribution value is larger than a preset likelihood distribution value threshold value, determining that the user to be identified and the target user are the same user.

In this embodiment, if the likelihood distribution value is greater than the preset likelihood distribution value threshold, it indicates that the similarity between the first voice data and the second voice data is higher, that is, it can be determined that the user to be identified and the target user are the same user. If the likelihood distribution value is smaller than or equal to the preset likelihood distribution value threshold, the likelihood distribution value indicates that the similarity between the first voice data and the second voice data is low, and it can be determined that the user to be identified and the target user are not the same user.

The likelihood distribution numerical threshold may be set according to the actual application scene, and exemplary likelihood distribution numerical threshold may be any value from 80 to 90.

In another embodiment, the method may further include:

and carrying out wake-up word recognition on the first voice data to obtain a wake-up word recognition result.

If the recognition result of the wake-up word is that the recognition is correct, executing a content recognition model which is included in the first voice data input identity recognition model, and outputting a target content vector; and inputting the first voice data into a voiceprint recognition model included in the identity recognition model, and outputting a target voiceprint vector.

In this embodiment, a third party wake-up word recognition system may be deployed in advance in the terminal device, before the first voice data is recognized by the identity recognition model, the first voice data may be first recognized by the third party wake-up word recognition system to obtain a wake-up word recognition result, and if the wake-up word recognition result is that the recognition is correct, it indicates that the first voice data includes a predefined keyword, the voice wake-up information may be continuously recognized by the identity recognition model to obtain a recognition result. If the recognition result of the wake-up word is a recognition error, which indicates that the first voice data does not contain the predefined keyword, a wake-up failure prompt can be directly generated to remind the user of voice wake-up failure without recognition through an identity recognition model.

Furthermore, after determining that the user to be identified and the target user are the same user, the method may further include: and generating a wake-up instruction, wherein the wake-up instruction is used for waking up the terminal equipment or an application program deployed in the terminal equipment.

After the scheme is adopted, through a double verification mode, when the third party wake-up word recognition system recognizes the wake-up word in error, whether the voice data contain the voice data matched with the preset voice data can be effectively judged by adding the identity recognition model of the phoneme modeling, the situation of voice misrecognition is effectively reduced, and further the situation of miswake-up of terminal equipment or related application programs in practical application is reduced.

Fig. 7 is a visual scene diagram of an identification method according to an embodiment of the present application, where the embodiment describes in detail a training process of an identification model and a complete implementation flow of the identification method in combination with a visual scene. As shown in fig. 7, the present embodiment may include: training the identity recognition model, the specific process can be as follows: firstly acquiring a training audio data set, inputting the training audio data set into a feature extraction module for feature extraction to obtain a training feature set, then respectively inputting the training feature set into a phoneme recognition model and a voiceprint recognition model for training, obtaining a voiceprint vector through the trained voiceprint recognition model after training is completed, and obtaining a phoneme vector through the trained phoneme recognition model. And then, the voice training vectors can be connected in parallel to form one-dimensional voice training vectors according to the voiceprint vectors and the phoneme vectors, and then, the classifier included in the model to be trained is subjected to iterative training according to the voice training vectors until the likelihood of the classifier is maximum, and parameters are converged to obtain the identity recognition model. After the training of the identification model is completed, the identification model can be deployed in the terminal equipment. The terminal device may be a smart phone, for example. After the identification model is deployed on the smart phone, first voice data generated by a user to be identified and pre-stored second voice data can be acquired, then the acquired first voice data and second voice data are identified based on the identification model, likelihood distribution values are obtained, if the likelihood distribution values are larger than a likelihood distribution value threshold, the user to be identified and a target user are determined to be the same user, and a wake-up instruction can be generated to wake up the terminal equipment or related application programs. Otherwise, determining that the user to be identified and the target user are different users, and generating a wake-up failure prompt to remind the user terminal equipment or related application programs of wake-up failure.

Based on the same idea, the embodiment of the present disclosure further provides a device corresponding to the method, and fig. 8 is a schematic structural diagram of an identity recognition model training device provided in the embodiment of the present disclosure, as shown in fig. 8, where the device provided in the embodiment may include:

the first obtaining module 801 is configured to obtain a training audio data set, and perform feature extraction on the training audio data set to obtain a training feature set.

In this embodiment, the training feature set is extracted from the training audio data set using mel-frequency cepstrum coefficient MFCC.

Further, the training audio data set comprises target training audio, the target training audio comprises preset wake-up words, and the specific gravity of the target training audio in the training audio data set is greater than or equal to a preset specific gravity threshold.

A first processing module 802, configured to input the training feature set into a content recognition model included in a model to be trained to perform iterative training, and input the training feature set into the content recognition model after training is completed to output a content vector; and inputting the training feature set into a voiceprint recognition model included in the model to be trained for iterative training, and inputting the training feature set into the voiceprint recognition model after training is completed to output a voiceprint vector.

The first processing module 802 is further configured to input the content vector and the voiceprint vector into a classifier included in the model to be trained for iterative training until likelihood of the classifier is maximum, and parameters converge to obtain an identity recognition model.

In another embodiment, the content recognition model is a convolutional enhanced based phoneme recognition model and the content vector is a phoneme vector. Correspondingly, the first processing module 802 is further configured to:

and inputting the training feature set into the phoneme recognition model for iterative training. And determining a first loss value of the phoneme recognition model according to a preset first gradient descent algorithm.

And under the condition that the first loss value is greater than or equal to a first loss value threshold value, completing training of the phoneme recognition model.

In addition, the voiceprint recognition model may be a residual network and a delayed neural network, and the first processing module 802 is further configured to:

inputting the training feature set into the residual error network and the time delay neural network for iterative training; and determining a second loss value of the residual error network and the time delay neural network according to a preset second gradient descent algorithm.

And under the condition that the second loss value is larger than or equal to a second loss value threshold value, training the residual error network and the time delay neural network is completed.

In another embodiment, the first processing module 802 is further configured to:

and connecting the content vector and the voiceprint vector in parallel to form a one-dimensional voice training vector.

And inputting the voice training vector into the classifier for iterative training.

In this embodiment, the first processing module 802 is further configured to:

And inputting the voice training vector and the initial maximum likelihood estimation expression into the classifier to perform iterative training.

Fig. 9 is a schematic structural diagram of an identification device provided in the embodiment of the present application, where an identification model obtained by using an identification model training device in the foregoing embodiment, as shown in fig. 9, the device provided in this embodiment may include:

a second obtaining module 901, configured to obtain first voice data of a user to be identified.

A second processing module 902, configured to input the first voice data into a content recognition model included in the identity recognition model, and output a target content vector; and inputting the first voice data into a voiceprint recognition model included in the identity recognition model, and outputting a target voiceprint vector.

The second processing module 902 is further configured to input the target content vector, the target voiceprint vector, the preset content vector, and the preset voiceprint vector into a classifier included in the identity recognition model, and output likelihood distribution values; the preset voice print vector is obtained by inputting second voice data of the target user into the voice print recognition model.

The second processing module 902 is further configured to determine that the user to be identified and the target user are the same user if the likelihood distribution value is greater than a preset likelihood distribution value threshold.

In another embodiment, the second processing module 902 is further configured to:

If the wake-up word recognition result is that the recognition is correct, the first voice data is input into a content recognition model included in an identity recognition model, and a target content vector is output; and inputting the first voice data into a voiceprint recognition model included in the identity recognition model, and outputting a target voiceprint vector.

And after the user to be identified and the target user are the same user, generating a wake-up instruction, wherein the wake-up instruction is used for waking up terminal equipment or an application program deployed in the terminal equipment.

The device provided in the embodiment of the present application may implement the method of the embodiment shown in fig. 2, and its implementation principle and technical effects are similar, and are not described herein again.

Fig. 10 is a schematic diagram of a hardware structure of an electronic device provided in an embodiment of the present application, as shown in fig. 10, an apparatus 1000 provided in the embodiment includes: at least one processor 1001 and memory 1002. The processor 1001 and the memory 1002 are connected by a bus 1003.

In a specific implementation, at least one processor 1001 executes computer-executable instructions stored in the memory 1002, so that the at least one processor 1001 performs the method in the above-described method embodiment.

The specific implementation process of the processor 1001 may refer to the above method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.

In the embodiment shown in fig. 10 described above, it should be understood that the processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, digital signal processors (english: digital Signal Processor, abbreviated as DSP), application specific integrated circuits (english: application Specific Integrated Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.

The memory may comprise high speed RAM memory or may further comprise non-volatile storage NVM, such as at least one disk memory.

The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or one type of bus.

The embodiment of the application also provides a computer readable storage medium, wherein computer execution instructions are stored in the computer readable storage medium, and when a processor executes the computer execution instructions, the identity recognition model training method or the identity recognition method of the method embodiment is realized.

Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements an identity model training method or an identity recognition method as described above.

The computer readable storage medium described above may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. A readable storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.

An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. In the alternative, the readable storage medium may be integral to the processor. The processor and the readable storage medium may reside in an application specific integrated circuit (Application Specific Integrated Circuits, ASIC for short). The processor and the readable storage medium may reside as discrete components in a device.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for training an identity recognition model, the method comprising:

2. The method of claim 1, wherein the inputting the content vector and the voiceprint vector into a classifier included in the model to be trained for iterative training comprises:

the content vector and the voiceprint vector are connected in parallel to form a one-dimensional voice training vector;

3. The method of claim 2, wherein said inputting the speech training vector into the classifier for iterative training comprises:

respectively carrying out mean value processing and initialization processing on the classifier according to the voice training vector to obtain an initial maximum likelihood estimation expression;

4. The method of claim 1, wherein the content recognition model is a convolutional-based enhanced phoneme recognition model and the content vector is a phoneme vector.

5. The method of claim 4, wherein the inputting the training feature set into a content recognition model included in the model to be trained for iterative training comprises:

inputting the training feature set into the phoneme recognition model for iterative training; determining a first loss value of the phoneme recognition model according to a preset first gradient descent algorithm;

6. The method of claim 1, wherein the voiceprint recognition model is a residual network and a time-delay neural network; the step of inputting the training feature set into a voiceprint recognition model included in the model to be trained for iterative training comprises the following steps:

inputting the training feature set into the residual error network and the time delay neural network for iterative training; determining a second loss value of the residual error network and the time delay neural network according to a preset second gradient descent algorithm;

7. The method according to any one of claims 1-6, wherein the training feature set is extracted from the training audio data set using mel-frequency cepstral coefficients MFCCs.

8. The method of any of claims 1-6, wherein the training audio data set comprises target training audio comprising a preset wake word, the target training audio having a specific gravity in the training audio data set greater than or equal to a preset specific gravity threshold.

9. A method of identity recognition, the method comprising:

acquiring first voice data of a user to be identified;

10. The method according to claim 9, wherein the method further comprises:

performing wake-up word recognition on the first voice data to obtain a wake-up word recognition result;

If the wake-up word recognition result is that the recognition is correct, the first voice data is input into a content recognition model included in an identity recognition model, and a target content vector is output; inputting the first voice data into a voiceprint recognition model included in the identity recognition model, and outputting a target voiceprint vector;

after the user to be identified and the target user are determined to be the same user, the method further comprises: and generating a wake-up instruction, wherein the wake-up instruction is used for waking up the terminal equipment or an application program deployed in the terminal equipment.

11. An identification model training device, characterized in that the device comprises:

12. An identification device, the device comprising:

13. An electronic device, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing computer-executable instructions stored in the memory causes the at least one processor to perform the identification model training method of any one of claims 1 to 8, or the identification method of any one of claims 9-10.

14. A computer readable storage medium having stored therein computer executable instructions which, when executed by a processor, implement the identity model training method of any one of claims 1 to 8, or the identity method of any one of claims 9-10.

15. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the identity model training method of any one of claims 1 to 8, or the identity method of any one of claims 9-10.