CN112382298B

CN112382298B - Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof

Info

Publication number: CN112382298B
Application number: CN202011282426.2A
Authority: CN
Inventors: 欧阳鹏; 刘玲
Original assignee: Beijing Qingwei Intelligent Technology Co ltd
Current assignee: Beijing Qingwei Intelligent Technology Co ltd
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2024-03-08
Anticipated expiration: 2040-11-17
Also published as: CN112382298A

Abstract

The invention provides a training method of a wake word voiceprint recognition model, which comprises the following steps: and extracting a forward voice characteristic series and a backward voice characteristic series. A first embedded layer identification network Ns is established. A second embedded layer identification network Nt is established. And acquiring the output characteristics of the joint voice embedding layer. A combined network Nc is established. The first embedded layer identification network Ns, the second embedded layer identification network Nt, and the combined network Nc are adjusted and trained. The invention solves the problem that a large number of texts need to be collected for training, relates to a voice print recognition system based on the text of wake-up words, and can make up for the loss of text information in the traditional voice print recognition by a realization method of multi-task recognition based on deep learning. The invention also provides a wake-up word voiceprint recognition method and a wake-up word voiceprint recognition model.

Description

Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof

Technical Field

The present invention relates to the field of speech recognition. The method is applied to recognition and processing of wake-up words. The invention particularly relates to a wake-up word voiceprint recognition method, a wake-up word voiceprint recognition model and a training method thereof.

Background

Voiceprint recognition achieves the purpose of distinguishing unknown sounds through the characteristic analysis of one or more voice signals, and is widely applied to criminal investigation and crime investigation, crime tracking, personalized application and the like, and is also often applied to the use of identity cards, credit cards, banking transactions, voice-controlled locks and the like.

Text-independent speaker recognition, which is typically trained on large amounts of unconstrained linguistic data, is implicitly normalized by the text information in one pass and is therefore advantageous for recognition because text-independent must be freed from the variability of speech text, which is however a fatal problem for text-dependent voice print recognition, because text information is quite important in text-dependent recognition, and thus the same model is applied to the text-dependent voice print recognition effect and its non-idealities,

the existing solution is to collect more text related data for training, and massive data are required to be collected for training. This method is not preferable for complex and numerous users, a plurality of intelligent devices exist nowadays, some wake-up words are for example, music is opened, some wake-up words are for example, a sound box is opened, if the same person sets two different wake-up words, the traditional voiceprint recognition is easy to confuse, and the user wakes up by mistake, because text information is implicitly normalized in text-related recognition, text-related variability text cannot be removed, so that the confusion of voiceprint recognition is easy to recognize due to the loss of the text information, and the existing method is to train for a certain text to collect massive data.

Disclosure of Invention

The invention aims to provide a training method of a wake word voiceprint recognition model, solves the problem that a large amount of texts need to be collected for training in the prior art, is a voiceprint recognition system related to the texts based on wake words, and can make up for the loss of text information in the traditional voiceprint recognition by a method for realizing multi-task recognition based on deep learning.

One aspect of the present invention provides a training method for a wake word voiceprint recognition model, including:

step S101, intercepting voice training data according to the forward sequence of the playing frames and the set intercepting frame length, and acquiring a forward voice characteristic series. And intercepting voice training data according to the reverse sequence of the playing frames and the set intercepting frame length to acquire a backward voice characteristic series. The speech training data includes segments of speech training data having wake-up word speech.

The series of forward speech features includes a plurality of sequential and orderly arranged forward speech feature units. The series of backward speech features includes a plurality of sequential and orderly arranged backward speech feature units.

Step S102, a first embedded layer recognition network is established and a series of forward speech features and a series of backward speech features are received. The first embedded layer identification network comprises: a speaker-embedded extractor, a first pooling layer, and a first full-connection layer.

The speaker embedding extractor recognizes the forward voice characteristic series and the backward voice characteristic series through an input layer and an hidden layer in the TDNN time delay neural network to acquire speaker characteristic recognition weight values of voice training data.

The first pooling layer pools speaker characteristic recognition weight values.

And the speaker characteristic recognition weight value after the first full-connection layer full-connection pooling is used for obtaining the speaker embedded layer output characteristic value.

Step S103, a second embedded layer recognition network is established and a forward voice feature series and a backward voice feature series are received. The second embedded layer identification network comprises: a text embedding extractor, a second pooling layer and a second full-join layer.

The text embedding extractor recognizes the forward voice characteristic series and the backward voice characteristic series through an input layer and an hidden layer in the TDNN time delay neural network to acquire text characteristic recognition weight values of voice training data.

The second pooling layer pools text feature recognition weight values.

And obtaining the text embedded layer output characteristic value by the text characteristic recognition weight value after the second full-connection layer full-connection pooling.

Step S104, combining the speaker embedded layer output characteristic value and the text embedded layer output characteristic value to obtain the combined voice embedded layer output characteristic.

Step S105, a combined network is established. And obtaining the current joint recognition weight value through the output characteristics of the combined network full-connection joint voice embedded layer.

Step S106, judging whether the current joint recognition weight value is a set training weight value, if so, reserving control parameters in the first embedded layer recognition network, the first embedded layer recognition network and the combined network, and if not, fully connecting the current joint recognition weight value to acquire speaker classification weight information and text classification weight information. And respectively adjusting the corresponding first embedded layer identification network and second embedded layer identification network according to the speaker classification weight information and the text classification weight information.

A second aspect of the present invention provides a wake word voiceprint recognition model that is capable of recognizing speech recognition data. And intercepting voice recognition data according to the forward sequence of the playing frames and the set intercepting frame length to acquire a forward voice characteristic series. And intercepting voice recognition data according to the reverse sequence of the playing frames and the set intercepting frame length to acquire a backward voice characteristic series. The speech recognition data includes a plurality of pieces of speech recognition data having wake-up word speech.

The wake word voiceprint recognition model comprises: a first embedded layer identification network, a second embedded layer identification network, and a combined network.

A first embedded layer recognition network receives the series of forward speech features and the series of backward speech features. The first embedded layer identification network comprises:

and the speaker embedded extractor is used for acquiring speaker characteristic recognition weight values of voice recognition data by recognizing the forward voice characteristic series and the backward voice characteristic series through an input layer and an hidden layer in the TDNN time delay neural network.

A first pooling layer that pools speaker characteristic recognition weight values. And

and the first full-connection layer is used for fully connecting the pooled speaker characteristic recognition weight values to acquire the speaker embedded layer output characteristic values.

A second embedded layer recognition network receives the series of forward speech features and the series of backward speech features. The second embedded layer identification network comprises: a text embedding extractor, a second pooling layer and a second full-join layer.

And the text embedding extractor is used for identifying the forward voice characteristic series and the backward voice characteristic series through an input layer and an hidden layer in the TDNN time delay neural network to acquire text characteristic identification weight values of voice identification data.

A second pooling layer that pools text feature recognition weight values. And

and the second full-connection layer is used for obtaining the text embedded layer output characteristic value by full-connection pooled text characteristic recognition weight value.

And combining the speaker embedded layer output characteristic value and the text embedded layer output characteristic value to obtain the combined voice embedded layer output characteristic. And

And the combined network is used for obtaining the current joint recognition weight value through the output characteristics of the combined network full-connection joint voice embedded layer. And acquiring wake-up word recognition result information according to the current joint recognition weight value.

The third aspect of the present invention provides a wake word voiceprint recognition method, which includes:

step 201, speech recognition data is acquired.

Step 202, obtaining wake-up word recognition result information by recognizing voice recognition data through the wake-up word voiceprint recognition model in the invention.

The characteristics, technical features, advantages and implementation modes of the wake-up word voiceprint recognition method, the wake-up word voiceprint recognition model and the training method thereof are further described below in a clear and easily understood manner by combining the drawings.

Drawings

Fig. 1 is a flow chart for explaining a training method of a wake word voiceprint recognition model in one embodiment of the present invention.

FIG. 2 is a diagram illustrating the composition and processing of wake word voiceprint recognition models in one embodiment of the present invention.

Fig. 3 is a schematic diagram for explaining the extraction of the series of forward and backward speech features in one embodiment of the present invention.

Fig. 4 is a flowchart for explaining a wake word voiceprint recognition method in still another embodiment of the present invention.

Detailed Description

For a clearer understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described with reference to the drawings, in which like reference numerals refer to identical or structurally similar but functionally identical components throughout the separate views.

In this document, "schematic" means "serving as an example, instance, or illustration," and any illustrations, embodiments described herein as "schematic" should not be construed as a more preferred or advantageous solution. For simplicity of the drawing, only the portions related to the present exemplary embodiment are schematically shown in the drawings, and they do not represent actual structures and actual proportions thereof as products.

On the one hand, the invention provides a training method of a wake word voiceprint recognition model, as shown in fig. 1, 2 and 3, comprising the following steps:

step S101, a forward voice feature series and a backward voice feature series are extracted.

In this step, as shown in fig. 3, the data segment is segmented according to the principle of the BLSTM, and the voice training data is intercepted according to the forward sequence of the play frame and the set intercepting frame length, so as to obtain the forward voice feature series. And intercepting voice training data according to the reverse sequence of the playing frames and the set intercepting frame length to acquire a backward voice characteristic series. The speech training data includes segments of speech training data having wake-up word speech.

In step S102, a first embedded layer identification network Ns is established.

In this step, as shown in fig. 2, a first embedded layer recognition network Ns is established and a series of forward voice features and a series of backward voice features are received. The first embedded layer identification network Ns includes: a speaker embedded extractor 101, a first pooling layer 102 and a first full connection layer 103.

A speaker-embedded extractor 101, which recognizes the forward speech feature series and the backward speech feature series through the input layer and hidden layer in the TDNN delay neural network, acquires the speaker feature recognition weight value of the speech training data.

A first pooling layer 102 that pools speaker characteristic recognition weight values.

And a first full connection layer 103, which is used for fully connecting the pooled speaker characteristic recognition weight values and obtaining the speaker embedded layer output characteristic values.

Step S103, a second embedded layer identification network Nt is established.

In this step, as shown in fig. 2, a second embedded layer recognition network Nt is established and a series of forward speech features and a series of backward speech features are received. The second embedded layer identification network Nt includes: a text embedding extractor 201, a second pooling layer 202 and a second full connection layer 203.

A text embedding extractor 201, which identifies the forward speech feature series and the backward speech feature series through the input layer and hidden layer in the TDNN delay neural network, obtains the text feature recognition weight value of the speech training data.

A second pooling layer 202 that pools text feature recognition weight values.

And a second full-connection layer 203, which is used for obtaining the text embedded layer output characteristic value by full-connection pooled text characteristic recognition weight value.

Step S104, obtaining the output characteristics of the joint voice embedding layer.

In this step, as shown in fig. 2, the speaker embedded layer output feature value and the text embedded layer output feature value are combined to obtain a joint speech embedded layer output feature.

In step S105, a combined network Nc is established.

In this step, as shown in fig. 2 and 3, a combined network Nc is established. And obtaining the current joint recognition weight value through the output characteristics of the combined network Nc full-connection joint voice embedding layer. The combined network Nc implements the above-described full connection operation through the first combined full connection layer 301.

Step S106, the first embedded layer identification network Ns, the second embedded layer identification network Nt and the combined network Nc are adjusted and trained.

In this step, it is determined whether the current joint recognition weight value is a set training weight value, if yes, control parameters in the first embedded layer recognition network Ns, and the combined network Nc are reserved, and if no, the speaker classification weight information and the text classification weight information are obtained by fully connecting the current joint recognition weight value through the second combined full connection layer 302. And respectively adjusting the corresponding first embedded layer identification network Ns and second embedded layer identification network Nt according to the speaker classification weight information and the text classification weight information.

As shown in fig. 2, in an embodiment of the training method of the wake word voiceprint recognition model of the present invention, the first embedded layer recognition network Ns further includes:

a first classification full-connection layer 104, which fully connects the speaker embedded layer output characteristic values, obtains the embedded layer output speaker classification weight information.

In another embodiment of the training method of the wake-up word voiceprint recognition model of the present invention, the second embedded layer recognition network Nt further includes:

and a second classification full-connection layer 204, which is used for fully connecting the text embedded layer output characteristic values and obtaining the embedded layer output text classification information.

In still another embodiment of the training method of the wake word voiceprint recognition model of the present invention, step S106 further includes:

judging whether the speaker classification weight information output by the embedded layer is smaller than the speaker classification weight information, if yes, adjusting the control parameters in the combined network Nc. If not, adjusting the control parameters of the first embedded layer identification network Ns.

Judging whether the text classification information output by the embedded layer is smaller than the text classification weight information, if so, adjusting control parameters in the combined network Nc. If not, adjusting the control parameters of the second embedded layer identification network Nt.

Therefore, the training time of multiple models is reduced, and various weights and control parameters in the models can be conveniently and quickly adjusted.

A second aspect of the present invention provides a wake word voiceprint recognition model, as shown in fig. 2, which is capable of recognizing speech recognition data. And intercepting voice recognition data according to the forward sequence of the playing frames and the set intercepting frame length to acquire a forward voice characteristic series. And intercepting voice recognition data according to the reverse sequence of the playing frames and the set intercepting frame length to acquire a backward voice characteristic series. The speech recognition data includes a plurality of pieces of speech recognition data having wake-up word speech.

The wake word voiceprint recognition model comprises: a first embedded layer identification network Ns, a second embedded layer identification network Nt and a combined network Nc.

A first embedded layer identifies the network Ns, which receives the series of forward speech features and the series of backward speech features. The first embedded layer identification network Ns includes:

a speaker-embedded extractor 101, which recognizes the forward speech feature series and the backward speech feature series through the input layer and hidden layer in the TDNN delay neural network, acquires speaker feature recognition weight values of speech recognition data.

A first pooling layer 102 that pools speaker characteristic recognition weight values. And

A second embedded layer identifies the network Nt that receives the series of forward speech features and the series of backward speech features. The second embedded layer identification network Nt includes: a text embedding extractor 201, a second pooling layer 202 and a second full connection layer 203.

A text embedding extractor 201, which identifies the forward speech feature series and the backward speech feature series through the input layer and hidden layer in the TDNN delay neural network, obtains the text feature recognition weight value of the speech recognition data.

A second pooling layer 202 that pools text feature recognition weight values. And

And the combined network Nc is used for acquiring the current joint recognition weight value through the output characteristics of the combined network Nc full-connection joint voice embedded layer. And acquiring wake-up word recognition result information according to the current joint recognition weight value.

In another embodiment of the wake word voiceprint recognition model of the present invention, the combined network Nc includes: judging whether the current joint recognition weight value is in a set range, if so, outputting wake-up word recognition passing information. If not, outputting wake-up word recognition failure information.

A third aspect of the present invention provides a wake word voiceprint recognition method, as shown in FIG. 4, which includes:

step 201, speech recognition data is acquired.

Step 202, obtaining wake-up word recognition result information.

In the step, the wake-up word voiceprint recognition model is used for recognizing voice recognition data to obtain wake-up word recognition result information.

The invention solves the problems of the prior method for training by gathering a large amount of texts, a voice print recognition system related to the texts based on wake-up words, and a realization method of multi-task recognition based on deep learning, which can make up for the loss of text information in the traditional voice print recognition.

The invention provides a bidirectional feature segment, the data segment is segmented according to the principle of BLSTM, feature extraction is carried out by dividing the data segment into a forward F and a backward B, the combination of the forward and backward can compensate the problem of time sequence loss, the time sequence problem of other non-cyclic neural networks is facilitated, the cyclic neural network (RNN) can solve the problem of correlation of the front and the rear of the sequence data, and the output of the BLSTM is obtained by simultaneously considering the factors of the front and the rear, so that the method has robustness. The feature extraction technology can solve the timing problem of other non-cyclic network structures, and the feature extraction process is described in detail in the specific implementation.

The text-related voiceprint recognition system based on the wake-up words can perform training text-related voiceprint recognition without massive wake-up word data under text-independent data, and the different wake-up words of each person do not need to retrain the voiceprint recognition and can effectively perform the voiceprint recognition.

The invention aims at the problems of text information loss, text information mismatching in training and verifying data, and a common algorithm is to train massive text-related data, and a great deal of difference can occur in each text recognition effect.

In one embodiment of the present invention, the common data is first extracted from the front and rear directions respectively, the detailed process of extracting the features is shown in fig. 1, and then the features extracted from the front and rear directions are combined, and the combined formula is shown in formula 1.

F _i ＝X[j:j+k+1]＝[X _j ,X _j+1 ,…,X _j+k ]

Training is performed according to the deep learning model structure, wherein the training process is shown in fig. 2, and the specific process is described as follows: firstly, the obtained bidirectional data features in the first step are respectively input into two substructures (Ns, nt), the two substructures are identical, the embedded extractor consists of the first two layers of the TDNN, a first pooling layer or a second pooling layer pooling is connected at the back, a first full-connection layer or a second full-connection layer dense layer is connected at the back, the output of the empedding is used, and the last first full-connection layer or the second full-connection layer dense is a classification layer.

Extracting spk embedding text output characteristic values from the substructure Nt, and combining the substructures combine embedding; the combined current joint identification weight value combine embedding is input into the substructure Nc. The calculation formula of the Loss function Loss is shown in formula 2:

L _s1 ＝CE(N _s (X _s ),y ^s )

L _t1 ＝KLD(N _t (X _t ),y ^t )

L _s2 ＝CE(N _c ([ebd _s ,ebd _t ]),y ^s )

L _t2 ＝CE(N _c ([ebd _s ,ebd _t ]),y ^t )

L _all ＝L _s1 +L _t1 +L _s2 +L _t2 equation 2

It should be understood that although the present disclosure has been described in terms of various embodiments, not every embodiment is intended to include only a single embodiment, and that such descriptions are provided for clarity only, and that the disclosure is not limited to the embodiments shown and described herein, as such, may be suitably combined in any number of embodiments, as would be apparent to one of ordinary skill in the art.

The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the spirit of the present invention should be included in the scope of the present invention.

Claims

1. The training method of the wake-up word voiceprint recognition model is characterized by comprising the following steps of

Step S101, intercepting voice training data according to the forward sequence of the playing frames and the set intercepting frame length to acquire a forward voice characteristic series; intercepting voice training data according to the reverse sequence of the playing frames and the length of the set intercepting frames to acquire a backward voice characteristic series; the voice training data comprises a plurality of pieces of voice training data with wake-up word voice;

the forward voice characteristic series comprises a plurality of continuous and orderly arranged forward voice characteristic units; the backward voice characteristic series comprises a plurality of backward voice characteristic units which are continuously and orderly arranged;

step S102, a first embedded layer recognition network is established, and the forward voice characteristic series and the backward voice characteristic series are received; the first embedded layer identification network comprises:

a speaker embedded extractor comprising an input layer and hidden layer structure in the TDNN latency neural network; the speaker embedded extractor recognizes the forward voice feature series and the backward voice feature series through an input layer and an hidden layer in the TDNN time delay neural network to obtain speaker feature recognition weight values of voice training data;

a first pooling layer that pools the speaker characteristic recognition weight values;

a first full-connection layer which is fully connected with the pooled speaker characteristic recognition weight value to acquire an output characteristic value of the speaker embedded layer;

step S103, a second embedded layer recognition network is established and the forward voice characteristic series and the backward voice characteristic series are received; the second embedded layer identification network comprises:

a text embedding extractor which recognizes the forward voice feature series and the backward voice feature series through an input layer and an hidden layer in the TDNN time delay neural network to obtain text feature recognition weight values of voice training data;

a second pooling layer that pools the text feature recognition weight values;

the second full-connection layer is fully connected with the pooled text feature recognition weight value to acquire a text embedding layer output feature value;

step S104, combining the speaker embedded layer output characteristic value and the text embedded layer output characteristic value to obtain a combined voice embedded layer output characteristic;

step S105, establishing a combined network; the current joint recognition weight value is obtained through the output characteristics of the joint voice embedded layer which are fully connected through the combined network;

step S106, judging whether the current joint recognition weight value is a set training weight value, if so, reserving control parameters in the first embedded layer recognition network, the first embedded layer recognition network and the combined network, and if not, fully connecting the current joint recognition weight value to acquire speaker classification weight information and text classification weight information; and respectively adjusting the corresponding first embedded layer identification network and the second embedded layer identification network according to the speaker classification weight information and the text classification weight information.

2. The training method of claim 1, wherein the first embedded layer identification network further comprises: and the first classification full-connection layer is fully connected with the speaker embedding layer output characteristic value to acquire the embedding layer output speaker classification weight information.

3. The training method of claim 2, wherein the second embedded layer identification network further comprises: and the second classification full-connection layer is fully connected with the text embedding layer output characteristic value to acquire the embedding layer output text classification information.

4. A training method as claimed in claim 3, characterized in that in said step S106 further comprises:

judging whether the speaker classification weight information output by the embedded layer is smaller than the speaker classification weight information, if yes, adjusting control parameters in the combined network; if not, adjusting the control parameters of the first embedded layer identification network;

judging whether the text classification information output by the embedded layer is smaller than text classification weight information, if so, adjusting control parameters in the combined network; if not, adjusting the control parameters of the second embedded layer identification network.

5. The wake-up word voiceprint recognition model is characterized by being capable of recognizing voice recognition data; intercepting voice recognition data according to the forward sequence of the playing frames and the length of the set intercepting frames to acquire a forward voice characteristic series; intercepting voice recognition data according to the reverse sequence of the playing frames and the length of the set intercepting frames to acquire a backward voice characteristic series; the voice recognition data comprises a plurality of pieces of voice recognition data with wake-up word voice;

the wake-up word voiceprint recognition model comprises:

a first embedded layer identification network that receives the series of forward speech features and the series of backward speech features; the first embedded layer identification network comprises:

a speaker embedded extractor which recognizes the forward voice feature series and the backward voice feature series through an input layer and an hidden layer in the TDNN time delay neural network to obtain speaker feature recognition weight values of voice recognition data;

a first pooling layer that pools the speaker characteristic recognition weight values; and

a second embedded layer identification network that receives the series of forward speech features and the series of backward speech features; the second embedded layer identification network comprises:

a text embedding extractor which recognizes the forward voice feature series and the backward voice feature series through an input layer and an hidden layer in the TDNN time delay neural network to obtain text feature recognition weight values of voice recognition data;

a second pooling layer that pools the text feature recognition weight values; and

combining the speaker embedded layer output characteristic value and the text embedded layer output characteristic value to obtain a combined voice embedded layer output characteristic; and

the combined network is fully connected with the output characteristics of the combined voice embedded layer to acquire a current combined recognition weight value; and acquiring wake-up word recognition result information according to the current joint recognition weight value.

6. The wake word voiceprint recognition model of claim 5, comprising in the combined network: judging whether the current joint recognition weight value is in a set range, if so, outputting wake-up word recognition passing information; if not, outputting wake-up word recognition failure information.

7. The wake-up word voiceprint recognition method is characterized by comprising the following steps of:

step 201, obtaining voice recognition data;

step 202, obtaining wake-up word recognition result information by recognizing the voice recognition data through the wake-up word voiceprint recognition model according to any one of claims 5 to 6.