CN112382298B - Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof - Google Patents

Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof Download PDF

Info

Publication number
CN112382298B
CN112382298B CN202011282426.2A CN202011282426A CN112382298B CN 112382298 B CN112382298 B CN 112382298B CN 202011282426 A CN202011282426 A CN 202011282426A CN 112382298 B CN112382298 B CN 112382298B
Authority
CN
China
Prior art keywords
voice
recognition
layer
series
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011282426.2A
Other languages
Chinese (zh)
Other versions
CN112382298A (en
Inventor
欧阳鹏
刘玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingwei Intelligent Technology Co ltd
Original Assignee
Beijing Qingwei Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qingwei Intelligent Technology Co ltd filed Critical Beijing Qingwei Intelligent Technology Co ltd
Priority to CN202011282426.2A priority Critical patent/CN112382298B/en
Publication of CN112382298A publication Critical patent/CN112382298A/en
Application granted granted Critical
Publication of CN112382298B publication Critical patent/CN112382298B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Abstract

The invention provides a training method of a wake word voiceprint recognition model, which comprises the following steps: and extracting a forward voice characteristic series and a backward voice characteristic series. A first embedded layer identification network Ns is established. A second embedded layer identification network Nt is established. And acquiring the output characteristics of the joint voice embedding layer. A combined network Nc is established. The first embedded layer identification network Ns, the second embedded layer identification network Nt, and the combined network Nc are adjusted and trained. The invention solves the problem that a large number of texts need to be collected for training, relates to a voice print recognition system based on the text of wake-up words, and can make up for the loss of text information in the traditional voice print recognition by a realization method of multi-task recognition based on deep learning. The invention also provides a wake-up word voiceprint recognition method and a wake-up word voiceprint recognition model.

Description

Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof
Technical Field
The present invention relates to the field of speech recognition. The method is applied to recognition and processing of wake-up words. The invention particularly relates to a wake-up word voiceprint recognition method, a wake-up word voiceprint recognition model and a training method thereof.
Background
Voiceprint recognition achieves the purpose of distinguishing unknown sounds through the characteristic analysis of one or more voice signals, and is widely applied to criminal investigation and crime investigation, crime tracking, personalized application and the like, and is also often applied to the use of identity cards, credit cards, banking transactions, voice-controlled locks and the like.
Text-independent speaker recognition, which is typically trained on large amounts of unconstrained linguistic data, is implicitly normalized by the text information in one pass and is therefore advantageous for recognition because text-independent must be freed from the variability of speech text, which is however a fatal problem for text-dependent voice print recognition, because text information is quite important in text-dependent recognition, and thus the same model is applied to the text-dependent voice print recognition effect and its non-idealities,
the existing solution is to collect more text related data for training, and massive data are required to be collected for training. This method is not preferable for complex and numerous users, a plurality of intelligent devices exist nowadays, some wake-up words are for example, music is opened, some wake-up words are for example, a sound box is opened, if the same person sets two different wake-up words, the traditional voiceprint recognition is easy to confuse, and the user wakes up by mistake, because text information is implicitly normalized in text-related recognition, text-related variability text cannot be removed, so that the confusion of voiceprint recognition is easy to recognize due to the loss of the text information, and the existing method is to train for a certain text to collect massive data.
Disclosure of Invention
The invention aims to provide a training method of a wake word voiceprint recognition model, solves the problem that a large amount of texts need to be collected for training in the prior art, is a voiceprint recognition system related to the texts based on wake words, and can make up for the loss of text information in the traditional voiceprint recognition by a method for realizing multi-task recognition based on deep learning.
One aspect of the present invention provides a training method for a wake word voiceprint recognition model, including:
step S101, intercepting voice training data according to the forward sequence of the playing frames and the set intercepting frame length, and acquiring a forward voice characteristic series. And intercepting voice training data according to the reverse sequence of the playing frames and the set intercepting frame length to acquire a backward voice characteristic series. The speech training data includes segments of speech training data having wake-up word speech.
The series of forward speech features includes a plurality of sequential and orderly arranged forward speech feature units. The series of backward speech features includes a plurality of sequential and orderly arranged backward speech feature units.
Step S102, a first embedded layer recognition network is established and a series of forward speech features and a series of backward speech features are received. The first embedded layer identification network comprises: a speaker-embedded extractor, a first pooling layer, and a first full-connection layer.
The speaker embedding extractor recognizes the forward voice characteristic series and the backward voice characteristic series through an input layer and an hidden layer in the TDNN time delay neural network to acquire speaker characteristic recognition weight values of voice training data.
The first pooling layer pools speaker characteristic recognition weight values.
And the speaker characteristic recognition weight value after the first full-connection layer full-connection pooling is used for obtaining the speaker embedded layer output characteristic value.
Step S103, a second embedded layer recognition network is established and a forward voice feature series and a backward voice feature series are received. The second embedded layer identification network comprises: a text embedding extractor, a second pooling layer and a second full-join layer.
The text embedding extractor recognizes the forward voice characteristic series and the backward voice characteristic series through an input layer and an hidden layer in the TDNN time delay neural network to acquire text characteristic recognition weight values of voice training data.
The second pooling layer pools text feature recognition weight values.
And obtaining the text embedded layer output characteristic value by the text characteristic recognition weight value after the second full-connection layer full-connection pooling.
Step S104, combining the speaker embedded layer output characteristic value and the text embedded layer output characteristic value to obtain the combined voice embedded layer output characteristic.
Step S105, a combined network is established. And obtaining the current joint recognition weight value through the output characteristics of the combined network full-connection joint voice embedded layer.
Step S106, judging whether the current joint recognition weight value is a set training weight value, if so, reserving control parameters in the first embedded layer recognition network, the first embedded layer recognition network and the combined network, and if not, fully connecting the current joint recognition weight value to acquire speaker classification weight information and text classification weight information. And respectively adjusting the corresponding first embedded layer identification network and second embedded layer identification network according to the speaker classification weight information and the text classification weight information.
A second aspect of the present invention provides a wake word voiceprint recognition model that is capable of recognizing speech recognition data. And intercepting voice recognition data according to the forward sequence of the playing frames and the set intercepting frame length to acquire a forward voice characteristic series. And intercepting voice recognition data according to the reverse sequence of the playing frames and the set intercepting frame length to acquire a backward voice characteristic series. The speech recognition data includes a plurality of pieces of speech recognition data having wake-up word speech.
The series of forward speech features includes a plurality of sequential and orderly arranged forward speech feature units. The series of backward speech features includes a plurality of sequential and orderly arranged backward speech feature units.
The wake word voiceprint recognition model comprises: a first embedded layer identification network, a second embedded layer identification network, and a combined network.
A first embedded layer recognition network receives the series of forward speech features and the series of backward speech features. The first embedded layer identification network comprises:
and the speaker embedded extractor is used for acquiring speaker characteristic recognition weight values of voice recognition data by recognizing the forward voice characteristic series and the backward voice characteristic series through an input layer and an hidden layer in the TDNN time delay neural network.
A first pooling layer that pools speaker characteristic recognition weight values. And
and the first full-connection layer is used for fully connecting the pooled speaker characteristic recognition weight values to acquire the speaker embedded layer output characteristic values.
A second embedded layer recognition network receives the series of forward speech features and the series of backward speech features. The second embedded layer identification network comprises: a text embedding extractor, a second pooling layer and a second full-join layer.
And the text embedding extractor is used for identifying the forward voice characteristic series and the backward voice characteristic series through an input layer and an hidden layer in the TDNN time delay neural network to acquire text characteristic identification weight values of voice identification data.
A second pooling layer that pools text feature recognition weight values. And
and the second full-connection layer is used for obtaining the text embedded layer output characteristic value by full-connection pooled text characteristic recognition weight value.
And combining the speaker embedded layer output characteristic value and the text embedded layer output characteristic value to obtain the combined voice embedded layer output characteristic. And
And the combined network is used for obtaining the current joint recognition weight value through the output characteristics of the combined network full-connection joint voice embedded layer. And acquiring wake-up word recognition result information according to the current joint recognition weight value.
The third aspect of the present invention provides a wake word voiceprint recognition method, which includes:
step 201, speech recognition data is acquired.
Step 202, obtaining wake-up word recognition result information by recognizing voice recognition data through the wake-up word voiceprint recognition model in the invention.
The characteristics, technical features, advantages and implementation modes of the wake-up word voiceprint recognition method, the wake-up word voiceprint recognition model and the training method thereof are further described below in a clear and easily understood manner by combining the drawings.
Drawings
Fig. 1 is a flow chart for explaining a training method of a wake word voiceprint recognition model in one embodiment of the present invention.
FIG. 2 is a diagram illustrating the composition and processing of wake word voiceprint recognition models in one embodiment of the present invention.
Fig. 3 is a schematic diagram for explaining the extraction of the series of forward and backward speech features in one embodiment of the present invention.
Fig. 4 is a flowchart for explaining a wake word voiceprint recognition method in still another embodiment of the present invention.
Detailed Description
For a clearer understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described with reference to the drawings, in which like reference numerals refer to identical or structurally similar but functionally identical components throughout the separate views.
In this document, "schematic" means "serving as an example, instance, or illustration," and any illustrations, embodiments described herein as "schematic" should not be construed as a more preferred or advantageous solution. For simplicity of the drawing, only the portions related to the present exemplary embodiment are schematically shown in the drawings, and they do not represent actual structures and actual proportions thereof as products.
On the one hand, the invention provides a training method of a wake word voiceprint recognition model, as shown in fig. 1, 2 and 3, comprising the following steps:
step S101, a forward voice feature series and a backward voice feature series are extracted.
In this step, as shown in fig. 3, the data segment is segmented according to the principle of the BLSTM, and the voice training data is intercepted according to the forward sequence of the play frame and the set intercepting frame length, so as to obtain the forward voice feature series. And intercepting voice training data according to the reverse sequence of the playing frames and the set intercepting frame length to acquire a backward voice characteristic series. The speech training data includes segments of speech training data having wake-up word speech.
The series of forward speech features includes a plurality of sequential and orderly arranged forward speech feature units. The series of backward speech features includes a plurality of sequential and orderly arranged backward speech feature units.
In step S102, a first embedded layer identification network Ns is established.
In this step, as shown in fig. 2, a first embedded layer recognition network Ns is established and a series of forward voice features and a series of backward voice features are received. The first embedded layer identification network Ns includes: a speaker embedded extractor 101, a first pooling layer 102 and a first full connection layer 103.
A speaker-embedded extractor 101, which recognizes the forward speech feature series and the backward speech feature series through the input layer and hidden layer in the TDNN delay neural network, acquires the speaker feature recognition weight value of the speech training data.
A first pooling layer 102 that pools speaker characteristic recognition weight values.
And a first full connection layer 103, which is used for fully connecting the pooled speaker characteristic recognition weight values and obtaining the speaker embedded layer output characteristic values.
Step S103, a second embedded layer identification network Nt is established.
In this step, as shown in fig. 2, a second embedded layer recognition network Nt is established and a series of forward speech features and a series of backward speech features are received. The second embedded layer identification network Nt includes: a text embedding extractor 201, a second pooling layer 202 and a second full connection layer 203.
A text embedding extractor 201, which identifies the forward speech feature series and the backward speech feature series through the input layer and hidden layer in the TDNN delay neural network, obtains the text feature recognition weight value of the speech training data.
A second pooling layer 202 that pools text feature recognition weight values.
And a second full-connection layer 203, which is used for obtaining the text embedded layer output characteristic value by full-connection pooled text characteristic recognition weight value.
Step S104, obtaining the output characteristics of the joint voice embedding layer.
In this step, as shown in fig. 2, the speaker embedded layer output feature value and the text embedded layer output feature value are combined to obtain a joint speech embedded layer output feature.
In step S105, a combined network Nc is established.
In this step, as shown in fig. 2 and 3, a combined network Nc is established. And obtaining the current joint recognition weight value through the output characteristics of the combined network Nc full-connection joint voice embedding layer. The combined network Nc implements the above-described full connection operation through the first combined full connection layer 301.
Step S106, the first embedded layer identification network Ns, the second embedded layer identification network Nt and the combined network Nc are adjusted and trained.
In this step, it is determined whether the current joint recognition weight value is a set training weight value, if yes, control parameters in the first embedded layer recognition network Ns, and the combined network Nc are reserved, and if no, the speaker classification weight information and the text classification weight information are obtained by fully connecting the current joint recognition weight value through the second combined full connection layer 302. And respectively adjusting the corresponding first embedded layer identification network Ns and second embedded layer identification network Nt according to the speaker classification weight information and the text classification weight information.
As shown in fig. 2, in an embodiment of the training method of the wake word voiceprint recognition model of the present invention, the first embedded layer recognition network Ns further includes:
a first classification full-connection layer 104, which fully connects the speaker embedded layer output characteristic values, obtains the embedded layer output speaker classification weight information.
In another embodiment of the training method of the wake-up word voiceprint recognition model of the present invention, the second embedded layer recognition network Nt further includes:
and a second classification full-connection layer 204, which is used for fully connecting the text embedded layer output characteristic values and obtaining the embedded layer output text classification information.
In still another embodiment of the training method of the wake word voiceprint recognition model of the present invention, step S106 further includes:
judging whether the speaker classification weight information output by the embedded layer is smaller than the speaker classification weight information, if yes, adjusting the control parameters in the combined network Nc. If not, adjusting the control parameters of the first embedded layer identification network Ns.
Judging whether the text classification information output by the embedded layer is smaller than the text classification weight information, if so, adjusting control parameters in the combined network Nc. If not, adjusting the control parameters of the second embedded layer identification network Nt.
Therefore, the training time of multiple models is reduced, and various weights and control parameters in the models can be conveniently and quickly adjusted.
A second aspect of the present invention provides a wake word voiceprint recognition model, as shown in fig. 2, which is capable of recognizing speech recognition data. And intercepting voice recognition data according to the forward sequence of the playing frames and the set intercepting frame length to acquire a forward voice characteristic series. And intercepting voice recognition data according to the reverse sequence of the playing frames and the set intercepting frame length to acquire a backward voice characteristic series. The speech recognition data includes a plurality of pieces of speech recognition data having wake-up word speech.
The series of forward speech features includes a plurality of sequential and orderly arranged forward speech feature units. The series of backward speech features includes a plurality of sequential and orderly arranged backward speech feature units.
The wake word voiceprint recognition model comprises: a first embedded layer identification network Ns, a second embedded layer identification network Nt and a combined network Nc.
A first embedded layer identifies the network Ns, which receives the series of forward speech features and the series of backward speech features. The first embedded layer identification network Ns includes:
a speaker-embedded extractor 101, which recognizes the forward speech feature series and the backward speech feature series through the input layer and hidden layer in the TDNN delay neural network, acquires speaker feature recognition weight values of speech recognition data.
A first pooling layer 102 that pools speaker characteristic recognition weight values. And
and a first full connection layer 103, which is used for fully connecting the pooled speaker characteristic recognition weight values and obtaining the speaker embedded layer output characteristic values.
A second embedded layer identifies the network Nt that receives the series of forward speech features and the series of backward speech features. The second embedded layer identification network Nt includes: a text embedding extractor 201, a second pooling layer 202 and a second full connection layer 203.
A text embedding extractor 201, which identifies the forward speech feature series and the backward speech feature series through the input layer and hidden layer in the TDNN delay neural network, obtains the text feature recognition weight value of the speech recognition data.
A second pooling layer 202 that pools text feature recognition weight values. And
and a second full-connection layer 203, which is used for obtaining the text embedded layer output characteristic value by full-connection pooled text characteristic recognition weight value.
And combining the speaker embedded layer output characteristic value and the text embedded layer output characteristic value to obtain the combined voice embedded layer output characteristic. And
And the combined network Nc is used for acquiring the current joint recognition weight value through the output characteristics of the combined network Nc full-connection joint voice embedded layer. And acquiring wake-up word recognition result information according to the current joint recognition weight value.
In another embodiment of the wake word voiceprint recognition model of the present invention, the combined network Nc includes: judging whether the current joint recognition weight value is in a set range, if so, outputting wake-up word recognition passing information. If not, outputting wake-up word recognition failure information.
A third aspect of the present invention provides a wake word voiceprint recognition method, as shown in FIG. 4, which includes:
step 201, speech recognition data is acquired.
Step 202, obtaining wake-up word recognition result information.
In the step, the wake-up word voiceprint recognition model is used for recognizing voice recognition data to obtain wake-up word recognition result information.
The invention solves the problems of the prior method for training by gathering a large amount of texts, a voice print recognition system related to the texts based on wake-up words, and a realization method of multi-task recognition based on deep learning, which can make up for the loss of text information in the traditional voice print recognition.
The invention provides a bidirectional feature segment, the data segment is segmented according to the principle of BLSTM, feature extraction is carried out by dividing the data segment into a forward F and a backward B, the combination of the forward and backward can compensate the problem of time sequence loss, the time sequence problem of other non-cyclic neural networks is facilitated, the cyclic neural network (RNN) can solve the problem of correlation of the front and the rear of the sequence data, and the output of the BLSTM is obtained by simultaneously considering the factors of the front and the rear, so that the method has robustness. The feature extraction technology can solve the timing problem of other non-cyclic network structures, and the feature extraction process is described in detail in the specific implementation.
The text-related voiceprint recognition system based on the wake-up words can perform training text-related voiceprint recognition without massive wake-up word data under text-independent data, and the different wake-up words of each person do not need to retrain the voiceprint recognition and can effectively perform the voiceprint recognition.
The invention aims at the problems of text information loss, text information mismatching in training and verifying data, and a common algorithm is to train massive text-related data, and a great deal of difference can occur in each text recognition effect.
In one embodiment of the present invention, the common data is first extracted from the front and rear directions respectively, the detailed process of extracting the features is shown in fig. 1, and then the features extracted from the front and rear directions are combined, and the combined formula is shown in formula 1.
F i =X[j:j+k+1]=[X j ,X j+1 ,…,X j+k ]
Training is performed according to the deep learning model structure, wherein the training process is shown in fig. 2, and the specific process is described as follows: firstly, the obtained bidirectional data features in the first step are respectively input into two substructures (Ns, nt), the two substructures are identical, the embedded extractor consists of the first two layers of the TDNN, a first pooling layer or a second pooling layer pooling is connected at the back, a first full-connection layer or a second full-connection layer dense layer is connected at the back, the output of the empedding is used, and the last first full-connection layer or the second full-connection layer dense is a classification layer.
Extracting spk embedding text output characteristic values from the substructure Nt, and combining the substructures combine embedding; the combined current joint identification weight value combine embedding is input into the substructure Nc. The calculation formula of the Loss function Loss is shown in formula 2:
L s1 =CE(N s (X s ),y s )
L t1 =KLD(N t (X t ),y t )
L s2 =CE(N c ([ebd s ,ebd t ]),y s )
L t2 =CE(N c ([ebd s ,ebd t ]),y t )
L all =L s1 +L t1 +L s2 +L t2 equation 2
It should be understood that although the present disclosure has been described in terms of various embodiments, not every embodiment is intended to include only a single embodiment, and that such descriptions are provided for clarity only, and that the disclosure is not limited to the embodiments shown and described herein, as such, may be suitably combined in any number of embodiments, as would be apparent to one of ordinary skill in the art.
The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the spirit of the present invention should be included in the scope of the present invention.

Claims (7)

1. The training method of the wake-up word voiceprint recognition model is characterized by comprising the following steps of
Step S101, intercepting voice training data according to the forward sequence of the playing frames and the set intercepting frame length to acquire a forward voice characteristic series; intercepting voice training data according to the reverse sequence of the playing frames and the length of the set intercepting frames to acquire a backward voice characteristic series; the voice training data comprises a plurality of pieces of voice training data with wake-up word voice;
the forward voice characteristic series comprises a plurality of continuous and orderly arranged forward voice characteristic units; the backward voice characteristic series comprises a plurality of backward voice characteristic units which are continuously and orderly arranged;
step S102, a first embedded layer recognition network is established, and the forward voice characteristic series and the backward voice characteristic series are received; the first embedded layer identification network comprises:
a speaker embedded extractor comprising an input layer and hidden layer structure in the TDNN latency neural network; the speaker embedded extractor recognizes the forward voice feature series and the backward voice feature series through an input layer and an hidden layer in the TDNN time delay neural network to obtain speaker feature recognition weight values of voice training data;
a first pooling layer that pools the speaker characteristic recognition weight values;
a first full-connection layer which is fully connected with the pooled speaker characteristic recognition weight value to acquire an output characteristic value of the speaker embedded layer;
step S103, a second embedded layer recognition network is established and the forward voice characteristic series and the backward voice characteristic series are received; the second embedded layer identification network comprises:
a text embedding extractor which recognizes the forward voice feature series and the backward voice feature series through an input layer and an hidden layer in the TDNN time delay neural network to obtain text feature recognition weight values of voice training data;
a second pooling layer that pools the text feature recognition weight values;
the second full-connection layer is fully connected with the pooled text feature recognition weight value to acquire a text embedding layer output feature value;
step S104, combining the speaker embedded layer output characteristic value and the text embedded layer output characteristic value to obtain a combined voice embedded layer output characteristic;
step S105, establishing a combined network; the current joint recognition weight value is obtained through the output characteristics of the joint voice embedded layer which are fully connected through the combined network;
step S106, judging whether the current joint recognition weight value is a set training weight value, if so, reserving control parameters in the first embedded layer recognition network, the first embedded layer recognition network and the combined network, and if not, fully connecting the current joint recognition weight value to acquire speaker classification weight information and text classification weight information; and respectively adjusting the corresponding first embedded layer identification network and the second embedded layer identification network according to the speaker classification weight information and the text classification weight information.
2. The training method of claim 1, wherein the first embedded layer identification network further comprises: and the first classification full-connection layer is fully connected with the speaker embedding layer output characteristic value to acquire the embedding layer output speaker classification weight information.
3. The training method of claim 2, wherein the second embedded layer identification network further comprises: and the second classification full-connection layer is fully connected with the text embedding layer output characteristic value to acquire the embedding layer output text classification information.
4. A training method as claimed in claim 3, characterized in that in said step S106 further comprises:
judging whether the speaker classification weight information output by the embedded layer is smaller than the speaker classification weight information, if yes, adjusting control parameters in the combined network; if not, adjusting the control parameters of the first embedded layer identification network;
judging whether the text classification information output by the embedded layer is smaller than text classification weight information, if so, adjusting control parameters in the combined network; if not, adjusting the control parameters of the second embedded layer identification network.
5. The wake-up word voiceprint recognition model is characterized by being capable of recognizing voice recognition data; intercepting voice recognition data according to the forward sequence of the playing frames and the length of the set intercepting frames to acquire a forward voice characteristic series; intercepting voice recognition data according to the reverse sequence of the playing frames and the length of the set intercepting frames to acquire a backward voice characteristic series; the voice recognition data comprises a plurality of pieces of voice recognition data with wake-up word voice;
the forward voice characteristic series comprises a plurality of continuous and orderly arranged forward voice characteristic units; the backward voice characteristic series comprises a plurality of backward voice characteristic units which are continuously and orderly arranged;
the wake-up word voiceprint recognition model comprises:
a first embedded layer identification network that receives the series of forward speech features and the series of backward speech features; the first embedded layer identification network comprises:
a speaker embedded extractor which recognizes the forward voice feature series and the backward voice feature series through an input layer and an hidden layer in the TDNN time delay neural network to obtain speaker feature recognition weight values of voice recognition data;
a first pooling layer that pools the speaker characteristic recognition weight values; and
a first full-connection layer which is fully connected with the pooled speaker characteristic recognition weight value to acquire an output characteristic value of the speaker embedded layer;
a second embedded layer identification network that receives the series of forward speech features and the series of backward speech features; the second embedded layer identification network comprises:
a text embedding extractor which recognizes the forward voice feature series and the backward voice feature series through an input layer and an hidden layer in the TDNN time delay neural network to obtain text feature recognition weight values of voice recognition data;
a second pooling layer that pools the text feature recognition weight values; and
the second full-connection layer is fully connected with the pooled text feature recognition weight value to acquire a text embedding layer output feature value;
combining the speaker embedded layer output characteristic value and the text embedded layer output characteristic value to obtain a combined voice embedded layer output characteristic; and
the combined network is fully connected with the output characteristics of the combined voice embedded layer to acquire a current combined recognition weight value; and acquiring wake-up word recognition result information according to the current joint recognition weight value.
6. The wake word voiceprint recognition model of claim 5, comprising in the combined network: judging whether the current joint recognition weight value is in a set range, if so, outputting wake-up word recognition passing information; if not, outputting wake-up word recognition failure information.
7. The wake-up word voiceprint recognition method is characterized by comprising the following steps of:
step 201, obtaining voice recognition data;
step 202, obtaining wake-up word recognition result information by recognizing the voice recognition data through the wake-up word voiceprint recognition model according to any one of claims 5 to 6.
CN202011282426.2A 2020-11-17 2020-11-17 Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof Active CN112382298B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011282426.2A CN112382298B (en) 2020-11-17 2020-11-17 Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011282426.2A CN112382298B (en) 2020-11-17 2020-11-17 Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof

Publications (2)

Publication Number Publication Date
CN112382298A CN112382298A (en) 2021-02-19
CN112382298B true CN112382298B (en) 2024-03-08

Family

ID=74584847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011282426.2A Active CN112382298B (en) 2020-11-17 2020-11-17 Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof

Country Status (1)

Country Link
CN (1) CN112382298B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114360522B (en) * 2022-03-09 2022-08-02 深圳市友杰智新科技有限公司 Training method of voice awakening model, and detection method and equipment of voice false awakening

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9824692B1 (en) * 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
CN108648759A (en) * 2018-05-14 2018-10-12 华南理工大学 A kind of method for recognizing sound-groove that text is unrelated
CN108877812A (en) * 2018-08-16 2018-11-23 桂林电子科技大学 A kind of method for recognizing sound-groove, device and storage medium
CN110120223A (en) * 2019-04-22 2019-08-13 南京硅基智能科技有限公司 A kind of method for recognizing sound-groove based on time-delay neural network TDNN
CN110570870A (en) * 2019-09-20 2019-12-13 平安科技(深圳)有限公司 Text-independent voiceprint recognition method, device and equipment
CN110634491A (en) * 2019-10-23 2019-12-31 大连东软信息学院 Series connection feature extraction system and method for general voice task in voice signal
CN110767231A (en) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 Voice control equipment awakening word identification method and device based on time delay neural network
CN111243604A (en) * 2020-01-13 2020-06-05 苏州思必驰信息科技有限公司 Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
CN111783939A (en) * 2020-05-28 2020-10-16 厦门快商通科技股份有限公司 Voiceprint recognition model training method and device, mobile terminal and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9824692B1 (en) * 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
CN108648759A (en) * 2018-05-14 2018-10-12 华南理工大学 A kind of method for recognizing sound-groove that text is unrelated
CN108877812A (en) * 2018-08-16 2018-11-23 桂林电子科技大学 A kind of method for recognizing sound-groove, device and storage medium
CN110120223A (en) * 2019-04-22 2019-08-13 南京硅基智能科技有限公司 A kind of method for recognizing sound-groove based on time-delay neural network TDNN
CN110767231A (en) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 Voice control equipment awakening word identification method and device based on time delay neural network
CN110570870A (en) * 2019-09-20 2019-12-13 平安科技(深圳)有限公司 Text-independent voiceprint recognition method, device and equipment
CN110634491A (en) * 2019-10-23 2019-12-31 大连东软信息学院 Series connection feature extraction system and method for general voice task in voice signal
CN111243604A (en) * 2020-01-13 2020-06-05 苏州思必驰信息科技有限公司 Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
CN111783939A (en) * 2020-05-28 2020-10-16 厦门快商通科技股份有限公司 Voiceprint recognition model training method and device, mobile terminal and storage medium

Also Published As

Publication number Publication date
CN112382298A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
Gomez-Alanis et al. A light convolutional GRU-RNN deep feature extractor for ASV spoofing detection
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
US6219639B1 (en) Method and apparatus for recognizing identity of individuals employing synchronized biometrics
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN110299142A (en) A kind of method for recognizing sound-groove and device based on the network integration
Wang et al. A network model of speaker identification with new feature extraction methods and asymmetric BLSTM
Khdier et al. Deep learning algorithms based voiceprint recognition system in noisy environment
Soleymani et al. Prosodic-enhanced siamese convolutional neural networks for cross-device text-independent speaker verification
CN110299132A (en) A kind of speech digit recognition methods and device
Sukhwal et al. Comparative study of different classifiers based speaker recognition system using modified MFCC for noisy environment
CN112382298B (en) Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof
Beigi Challenges of LargeScale Speaker Recognition
CN102496366B (en) Speaker identification method irrelevant with text
CN100570712C (en) Based on anchor model space projection ordinal number quick method for identifying speaker relatively
Alashban et al. Speaker gender classification in mono-language and cross-language using BLSTM network
CN111091840A (en) Method for establishing gender identification model and gender identification method
WO1995005656A1 (en) A speaker verification system
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium
Wang et al. Capture interspeaker information with a neural network for speaker identification
CN110738985A (en) Cross-modal biometric feature recognition method and system based on voice signals
Novakovic Speaker identification in smart environments with multilayer perceptron
Naveen et al. Speaker Identification and Verification using Deep Learning
Roy et al. Speaker recognition using multimodal biometric system
Chetty et al. Biometric person authentication with liveness detection based on audio-visual fusion
Habib et al. SpeakerNet for Cross-lingual Text-Independent Speaker Verification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant